Document scoring, term weight calculation and vector space model of information retrieval

Last Update:2018-07-26 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, the main content: in the case of large document size, the number of results that satisfy the Boolean query can be very much, often much more than the number of documents users can browse. You need to score and sort the documents you have searched for. ①, parametric index and the concept of domain index; Objective: 1, the document can be indexed and retrieved by metadata (author, title, publication date, etc.); 2, the above index can provide a simple document scoring, ②, word term word in the document weight of the concept, and through the period of statistical information to calculate the weight; ③ , each document is represented as a vector that calculates the result of the weights above, which allows you to calculate the similarity of the query and each document. (vector space method); The various deformations of weights in ④ and vector space models. 2. Parametric index and Domain index: ①, metadata: Refers to some specific forms of data related to the document, such as the author of the document, the title and the date of publication, etc. ②, fields: Metadata contains field information, such as the date the document was created, the format of the document, and the author's information, which is usually limited. A bit like a field (attribute) in a table in a database; ③, parameterized index: There is a parameterized index for each field (such as the creation time of the document), through which we only select the document that satisfies the query requirement on the Time field; ④, domain index: Fields can be made up of arbitrary, unlimited numbers of text. For example, the title and summary of the document as a domain. You can index different fields of the document. Parameterized indexes, dictionaries often come from a fixed vocabulary, but in a domain index, dictionaries come from all the words of free text in the field.

Figure 6-2 After encoding:

Figure 6-3 ⑤, Domain Weighted average: Given a series of documents, each document in Jiading has an L domain, which corresponds to a weight of g1,g2,g3......gl∈[0,1], and satisfies its sum of 1. Make Si not queried and the document's I-Domain match score (1: match, 0: not matched). The domain weighted scoring method can be defined as:, the method also becomes a sort of boolean retrieval. The key here is to calculate the score for the document. ⑥, Weight learning: Training sample error function:, where, g∈[0,1] This is a two-time planning problem that can be See the 5th chapter of the Mathematical Plan, the constraint Planning section. ⑦, term frequency and weight calculation: Ask the question: is the importance of all the words in the thermostat the same? Core idea: assigns a low weight to the document frequency (DF T, which indicates the number of all documents that appear in the word item T). The more often it is, the less valuable it is, such as stopping words. expressed using the inverse document frequency. Reverse document frequency: IDF T=log (N/df t). where n is the total number of documents. Term frequency: The number of occurrences of the term T in document D, recorded as TF T,d. weight calculation: TF-IDF t,d=tf T,DXIDF t 1, when T appears in only a few documents more than once, the weights are the most valued, and these documents can be provided The strongest distinguishing ability; 2, when T appears in a document a few times, or in a number of documents appear, the weight of the second, at this time the final correlation calculation effect is not significant; 3, If T appears in all documents, the weight value is minimized. ⑧, vector space Model (VSM): is the basis for a series of related processing in the field of information retrieval, such as document scoring, classification and clustering of documents. (You can see the collective intelligenceHPE Programming chapter sixth document filtering and "machine learning Combat" fourth chapter based on probability theory classification method: Naive Bayesian constructs word vector from text P58) Use cosine similarity to define the similarity of two articles:

Here's an example: Figure 6-12⑨, query vectors:

Example 6-4:

The ⑩ vector similarity calculation algorithm is shown below:
Figure 6-14 for the final step, the highest K scores can be achieved using two of the heap, see "Introduction to the Algorithm," chapter 19th two heap.
There are other ways to calculate TF-IDF weights, such as TF's sub-current scale transformation method, based on the maximum value of TF normalization;
Problem: In the actual calculation, the similarity degree is the inner product calculation of tens of thousands of-dimensional vectors, and the computational amount is large.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More