Summary of the sixth chapter of the introduction to information retrieval

Last Update:2018-07-23 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first, ranked retrieval

In the previous we discussed the Boolean query, so the results are either matched, or mismatched, when the result of a lot of matches, we will find that we need to sort the document;

second, parameterized index and domain index

The document has metadata in addition to the text, such as the creation time, the title of the document, etc., so we can also restrict him, such as restricting the query document results must be published in 2010;

Parameterized index (parametric index): There is a certain limit to the value of a field, such as the limit of the range of values, the date is parameterized index;

Domain index: There is no limit to the value of a field, such as the title, can be arbitrary text;

third, domain weighted scoring

A document has a domain f1,f2,f3, and each domain has a different weight;

Each domain has a weight value w1,w2,w3; The weights are determined by machine learning (training set).

S (q,d) =w1*s (Q,F1) +w2*s (Q,F2) +w3*s (Q,F3);

So S (q,d) just calculates a domain score for a document and a query;

Iv. Term Frequency

For a given query, we calculate each item's score separately for each word iteration, and then add the query to the document by adding the item's score in the document.

The weight of a word item in a document is the number of times the word item appears in the document;

Bag of Word Model: the order of the word items in the document is ignored and only the number of occurrences is concerned;

Compared with the Boolean retrieval model, there is a great improvement.

TF: The number of occurrences of a word item in a document;

TF indicates that the weight of a word item depends only on the number of occurrences, which can be problematic, such as the number of occurrences in the document, the TF is very high, but the word is not important at all;

For convenience, we generally have the TF value in the inverted index of the posting, and the IDF value exists in the dictionary;

If you rely only on TF as the weight value, the importance of each word in the document is assumed to be the same;

So we need to introduce other weighting representations;

v. Inverse Document Frequency

DF: The number of documents in which a word item appears;

IDF:DF, the lower the IDF value, when a word entry appears in more documents;

CF: a word item appears in a document set several times;

, n indicates the number of documents;

Therefore, we need to combine TF-IDF for weight calculation;

The score of each document is determined by the sum of the scores of each word in the query for this document;

Document Scoring method:

For known query,collections, you need to look for documents related to query in the documentation set;

We divide the query into a set of words, and then for each word, calculate the TF-IDF value in the document set, and finally add up the value of all the TF-IDF to the score of a document;

six, vector space model VSM

Main idea: The document is regarded as the sequence of terms, that is, the vector;

Constituent term-the document matrix, where each value is the weight of a particular word item corresponding to a particular document, indicating the importance of the term item in the document;

European normalization: given (N1,N2,N3,N4); for normalization of this vector, (n1/n,n2/n,n3/n,n4/n); where n is Euclidean length; 1. Two document similarity calculation

0. Construct each document as a vector;

1. Two documents are normalized separately;

2. Vector dot product;

Users can find a webpage similar to a webpage by this method; 2. Query Vector

The query is a vector, and the document set consists of a document vector;

For query vectors, each value represents the importance (weight) of the term item in the query, which can be weighted by TF or DF or TF-IDF;

Like what:

The weight of the query is TF-IDF and normalized;

The weight of the document vector is the TF value and normalized;

Finally, the product of two vectors is the score of (Q,D);

In the above image, the query vectors are obtained by calculating the weights (0,1.3,2.0,3.0);

Note: For example, the query vector is auto car, but does not necessarily make the last query weight (1,0,1,0), because if the TF is not used, only DF, then the weight of the query vector is (2.3,1.3,2.0,3.0);

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More