Introduction to Information Retrieval Study Notes (6)-document scoring, term weight calculation and vector space model

Source: Internet
Author: User
So far, we have considered documents as a series of word item sequences. in fact, most documents have additional structure information. digital documents usually encode the related metadata (metadata) in the form of machine reading. metadata refers to some specific forms of data related to the document, such as the author, title, and publication date of the document. question: "Find a document written by William Shakespeare in 1961 that contains the phrase Alas poor Yorick ". as in general, the query processing process requires the merging of inverted record tables. however, the difference is that merging operations on parameterized indexes are also involved in processing queries. how to Design the index structure for the above queries? There are usually two ways: the dictionary method is used to create an index for different domains in the dictionary. as in general, the query process requires the merging of inverted record tables (the merging of inverted indexes in different domains ). the inverted record processing method is clear. For queries, the processing of domain indexes in the dictionary mode requires merging of Multiple indexes, which is relatively less efficient. therefore, domain information is stored in inverted records rather than dictionaries. another important reason for using this encoding is that it supports the use of the weighted zone scoring Technology for domain domains and fields, but its content can be any free text. fields are generally less likely to take values, and fields can be composed of any number of unrestricted texts. for example, you can generally regard the title and abstract of a document as a domain. we can build an independent inverted index for different fields of the document. given a Boolean query Q and A document D, the weighted scoring method of the domain is used to calculate a score between [0, 1] for each (Q, d) pair. this score is a linear combination of scores in each domain, and the score in each domain is a Boolean value: either 0 or 1. more specifically, given A series of documents, assuming that each document has L fields, the corresponding weights are G1 ,..., GL (-[0, 1], which satisfies the requirement that Si is the matching score for the I domain of the query and the document (1 and 0 respectively indicate matching and not matching ). for example, in the and query, if all query terms appear in a domain, the corresponding score of this domain is 1, otherwise it is 0. so far, the frequency and weight of word items have been calculated. We only consider the occurrence of word items in the document domain. now we start to consider the word term frequency. It is common sense that if a document or domain word term appears more frequently, the score of this document or domain is higher than the word term frequency: word term T frequency refers to the number of times word term T appears in document D. It is recorded as:. Two subscripts correspond to Word Terms and documents respectively. How can we evaluate the score? The simplest way is to rate the document based on the number of times the term appears in the document (that is, the original word frequency TF). However, the original TF is not suitable: A word appears 10 times in document A, that is, TF = 10. If TF = 1 appears once in document B, A is more relevant than B, however, the correlation will not be 10 times different. correlation is not proportional to the term frequency. A method that replaces the original TF: The number 1 in the logarithm term frequency formula is used for smooth calculation. If the TF value is 1, the calculation formula for processing another TF variant using the logarithm frequency plus 1: The method based on the maximum value of TF normalization is called enhanced normalized TF, where A is the adjustment factor, in the past, the experience value was 0.5. New studies showed that the value of 0.4 is better. TF in the formula represents the actual word frequency, while Max (TF) represents the word frequency corresponding to the word that appears the most frequently in all words in the document. the reason for this operation is mainly to suppress long documents, because if the document is long, the TF value of all words in the long document is generally higher than the value of the segment document, however, this does not mean that long documents are more relevant to queries. dividing the actual word frequency by the maximum word frequency in the document is equivalent to normalized conversion of absolute values. The meaning of the formula is converted to the relative importance between words in the same document. the frequency of original word items in inverse Document Frequency faces such a serious problem that all word items are considered equally important during relevance calculation with queries. obviously, a rare word term contains more information than a common word term. therefore, when sorting search results, we want to assign high weights to rare words, while for common words, we want to assign low weights to introduce document frequency (Document Frequency) DF, it indicates that the number of rare word items in all documents where T appears is more than the information contained in common word items. Therefore, DF is a value inversely proportional to the amount of information contained in Word item T. because DF itself is usually large, it usually needs to be mapped to a smaller value range. therefore, assume that the number of all documents is N, and the IDF (inverse Document Frequency, inverse text frequency) of term T is defined as follows: IDF is an important indicator of the amount of information that reflects word term T. A rare word often has a high IDF, but the IDF of a frequently used word may be low. IDF impact on Sorting:
  • IDF affects the sorting results of documents that contain at least two word items. for example, in the query "arachnocentric line", the IDF weight calculation method increases the relative weight of arachnocentric and reduces the relative weight of line.
  • For word item queries, IDF has no effect on document sorting.
TF-IDF weight calculation for each word item in each document, the TF and IDF can be combined to form the final weight. the TF-IDF weight mechanism grants the following weight to word item t in document D: in other words, TF-IDF assigns the weight to word item t in document D in the following way:
  1. When T appears only in a few documents multiple times, the weight value is the maximum (which can provide the strongest differentiation capability for these documents)
  2. When T appears rarely in a document, or in many documents, the weight value is second (this does not play a major role in the final correlation calculation)
  3. If T appears in all documents, the weight value is the minimum.
After analyzing the above concepts, we can look at the bm25 algorithm. the bm25 algorithm is usually used for scoring search relevance. in one sentence, the main idea is to parse the query elements and generate the elements Qi. then, for each search result D, calculate the correlation score between each Qi and D. Finally, calculate the weighted sum of the correlation score between Qi and D, the correlation score between query and D is obtained. phoneme (Popular Science): the speed of speech is the smallest speech and semantic ing. The general formula of the basic unit bm25 algorithm is as follows:
  • Q indicates Query
  • Qi indicates a language speed after Q resolution (for Chinese, we can use the word segmentation of query as the morphological analysis, and each word as the language speed Qi)
  • D indicates a search result document.
  • W indicates the weight of the language speed Qi
  • R (QI, d) indicates the correlation score between the morphological Qi and document D.
Weight wi determines the correlation weight between a word and a document. There are many methods. IDF is commonly used. The formula is as follows: n indicates the number of all documents in the index, N (QI) indicates the number of documents containing the morphological QI. According to the IDF definition, the larger the number of documents containing Qi in a given document set, the lower the Qi weight. that is to say, when many documents contain Qi, the differentiation degree of Qi is not high, so the importance of using Qi to judge correlation is low. (For example, the word appears frequently, but the amount of information contained is very low) The correlation score in R (QI, d) bm25 is generally in the form of: where:
  • K1, K2, and B are adjustment factors, which are usually set based on experience. Generally, k1 = 2, B = 0.75
  • FI is Qi in D, that is, word frequency, and qfi is Qi in query.
  • DL is the length of document D, and avgdl is the average length of all documents. In most cases, Qi only appears once in query, so qfi is 1

In conclusion, the formula can be simplified:


From the definition of K, we can see that the function of parameter B is to adjust the influence of document length on relevance. the larger the value of B, the greater the impact of document length on relevance score. this can be understood as: when the document is long, the larger the chance of containing Qi is. therefore, in equivalent FI, the correlation between long documents and Qi should be weaker than that between short documents and Qi.

Reference http://www.cnblogs.com/god_bless_you/archive/2012/08/20/2647730.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.