Document scoring, term weight calculation and vector space model of information retrieval

Source: Internet
Author: User
Tags idf
1, the main content: in the case of large document size, the number of results that satisfy the Boolean query can be very much, often much more than the number of documents users can browse.     You need to score and sort the documents you have searched for. ①, parametric index and the concept of domain index; Objective: 1, the document can be indexed and retrieved by metadata (author, title, publication date, etc.); 2, the above index can provide a simple document scoring, ②, word term word in the document weight of the concept, and through the period of statistical information to calculate the weight; ③ , each document is represented as a vector that calculates the result of the weights above, which allows you to calculate the similarity of the query and each document. (vector space method); The various deformations of weights in ④ and vector space models. 2. Parametric index and Domain index: ①, metadata: Refers to some specific forms of data related to the document, such as the author of the document, the title and the date of publication, etc. ②, fields: Metadata contains field information, such as the date the document was created, the format of the document, and the author's information, which is usually limited. A bit like a field (attribute) in a table in a database; ③, parameterized index: There is a parameterized index for each field (such as the creation time of the document), through which we only select the document that satisfies the query requirement on the Time field; ④, domain index: Fields can be made up of arbitrary, unlimited numbers of text. For example, the title and summary of the document as a domain. You can index different fields of the document. Parameterized indexes, dictionaries often come from a fixed vocabulary, but in a domain index, dictionaries come from all the words of free text in the field.

Figure 6-2 After encoding:

Figure 6-3    ⑤, Domain Weighted average: Given a series of documents, each document in Jiading has an L domain, which corresponds to a weight of g1,g2,g3......gl∈[0,1], and satisfies its sum of 1. Make Si not queried and the document's I-Domain match score (1: match, 0: not matched). The domain weighted scoring method can be defined as:, the method also becomes a sort of boolean retrieval.         The key here is to calculate the score for the document.    ⑥, Weight learning:         Training sample error function:, where, g∈[0,1]           This is a two-time planning problem that can be See the 5th chapter of the Mathematical Plan, the constraint Planning section.    ⑦, term frequency and weight calculation:         Ask the question: is the importance of all the words in the thermostat the same?         Core idea: assigns a low weight to the document frequency (DF T, which indicates the number of all documents that appear in the word item T). The more often it is, the less valuable it is, such as stopping words. expressed using the inverse document frequency.         Reverse document frequency: IDF T=log (N/df t). where n is the total number of documents.         Term frequency: The number of occurrences of the term T in document D, recorded as TF T,d.         weight calculation: TF-IDF t,d=tf T,DXIDF t         1, when T appears in only a few documents more than once, the weights are the most valued, and these documents can be provided The strongest distinguishing ability;         2, when T appears in a document a few times, or in a number of documents appear, the weight of the second, at this time the final correlation calculation effect is not significant;         3, If T appears in all documents, the weight value is minimized.    ⑧, vector space Model (VSM):         is the basis for a series of related processing in the field of information retrieval, such as document scoring, classification and clustering of documents. (You can see the collective intelligenceHPE Programming chapter sixth document filtering and "machine learning Combat" fourth chapter based on probability theory classification method: Naive Bayesian constructs word vector from text P58)         Use cosine similarity to define the similarity of two articles:

Here's an example: Figure 6-12⑨, query vectors:

Example 6-4:


The ⑩ vector similarity calculation algorithm is shown below:
Figure 6-14 for the final step, the highest K scores can be achieved using two of the heap, see "Introduction to the Algorithm," chapter 19th two heap.
There are other ways to calculate TF-IDF weights, such as TF's sub-current scale transformation method, based on the maximum value of TF normalization;
Problem: In the actual calculation, the similarity degree is the inner product calculation of tens of thousands of-dimensional vectors, and the computational amount is large.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.