SOLR Similarity Algorithm III: introduction of Drfsimilarity Framework

Source: Internet
Author: User
Tags solr

Address: http://terrier.org/docs/v3.5/dfr_description.html

The divergence from randomness (DFR) paradigm is a generalisation of one of the very first models of information retrieval , Harter ' s 2-poisson indexing-model [1]. The 2-poisson model is based on the hypothesis which the level of treatment of the informative words are witnessed by an Elite set of documents, in which these words occur to a relatively greater extent than in the rest of the documents.

On the other hand, there is words, which do not possess elite documents, and thus their frequency follows a random Distri Bution, which is the single Poisson model. Harter ' s model was first explored as a retrieval-model by Robertson, Van Rijsbergen and Porter [4]. Successively it is combined with standard probabilistic model by Robertson and Walker [3] and gave birth to the family of The BMs IR models (among them there is the well-known BM25 which are at the basis the Okapi system).

DFR models is obtained by instantiating the three components of the framework:selecting a basic randomness model, Applyi Ng the first normalisation and normalising the term frequencies.

Basic Randomness Models

The DFR models is based on the that the more the divergence of the within-document term-frequency from its freq Uency within the collection, the more of the information carried by the word T in the document D". In and words the term-weight is inversely related to the probability of term-frequency within the document D ob Tained by a model M of randomness:

(1)

where the subscript  M  stands for the type of model of randomness employed to compute the probability . In order to choose the appropriate model  M  of randomness, we can use different urn models. IR is thus seen as a probabilistic process, which uses random drawings from urn models, or equivalently random placement O F coloured balls into urns. Instead of  urns  we have  documents , and Instead of different  Colours  we has different  terms , where each term occurs with some multiplicity in the urns as anyone of a numb Er of related words or phrases which is called  tokens  of the term. There is many ways to choose  M , each of these provides a  basic DFR Model . The basic models is derived in the following table.

Basic DFR Models
D Divergence approximation of the binomial
P Approximation of the binomial
Be Bose-einstein distribution
G Geometric approximation of the Bose-einstein
I (N) Inverse Document Frequency model
I (F) Inverse term Frequency model
I (NE) Inverse expected Document Frequency model

If the Model M is the binomial distribution, then the basic model is P and computes the value1:

(2)

where

    • TF is the term-frequency of the term T in the Collection
    • TF is the term-frequency of the term T in the document D
    • N is the number of documents in the Collection
    • p is 1/N and q=1-p

Similarly, if the model M is the geometric distribution, then the basic model is G and computes the Value

(3)

whereλ= F/N.

First normalisation

When a rare term does not occur in a document then it had almost zero probability of being informative for the document . On the contrary, if a rare term had many occurrences in a document then it had a very high probability (almost the certain TY) to being informative for the topic described by the document. Similarly to Ponte and Croft ' s [2] language model, we include a risk component in the DFR models. If the term-frequency in the document was high then the risk for the term of not being informative is minimal. In such a case formula  (1)  gives a high value, but a  minimal risk  have also the negative effec T of providing a  small  information gain. Therefore, instead of using the full weight provided by the formula  (1), we  tune  or  Smoo Th  the weight of formula  (1)  by considering only the portion of it which are the amount of information Gained with the term:

(4)

The more of the term occurs in the elite set, the less term-frequency was due to randomness, and thus the smaller the Probabil ity Prisk is, which is:

(5)

We use both models for computing the Information-gain with a term within a document:the Laplace L model and the R Atio of Bernoulli ' s processes B:

(6)

Where DF is the number of documents containing the term.

Term Frequency normalisation

Before using Formula (4) The Document-length DL is normalised to a standard length sl. Consequently, the Term-frequencies TF is also recomputed with respect to the standard Document-length, which is:

(7)

A more flexible formula, referred to as Normalisation2, is given below:

(8)

DFR Models is finally obtained from the generating Formula (4), using a basic DFR model (such as formulae (2) or (3)) in Combination with a model of information-gain (such as Formula 6) and normalising the term-frequency (such as in Formula (7 ) or Formula (8)).

DFR Models in Terrier

Included with Terrier, is many of the DFR models, including:

Model Description
BB2 Bernoulli-einstein model with Bernoulli After-effect and normalisation 2.
IFB2 Inverse term Frequency model with Bernoulli After-effect and normalisation 2.
In_expb2 Inverse expected Document Frequency model with Bernoulli After-effect and normalisation 2. The logarithms is base 2. This model can is used for classic ad-hoc tasks.
In_expc2 Inverse expected Document Frequency model with Bernoulli After-effect and normalisation 2. The logarithms is base e. This model can is used for classic ad-hoc tasks.
InL2 Inverse Document Frequency model with Laplace After-effect and normalisation 2. This model can is used for the tasks that require early precision.
PL2 Poisson model with Laplace After-effect and normalisation 2. This model can is used for the tasks that require early precision [7, 8]

Recommended settings for various collection is provided in Example TREC experiments.

Another provided weighting model is a derivation of the BM25 formula from the divergence from the randomness framework. Finally, Terrier also provides a generic DFR weighting model, which allows any DFR model to be generated and evaluated.

Query Expansion

The query expansion mechanism extracts the most informative terms from the top-returned documents as the expanded query Te Rms. In this expansion process, terms in the top-returned documents is weighted using a particular DFR term weighting model. Currently, Terrier deploys the Bo1 (Bose-einstein 1), Bo2 (Bose-einstein 2) and KL (Kullback-leibler) term weighting model S. The DFR term weighting models follow a parameter-free approach in default.

An alternative approach is Rocchio ' s query expansion mechanism. A user can switch to the latter approach by settingParameter.free.expansion to false in the terrier. Properties file. The default value of the parameter beta of Rocchio ' s approach is 0.4. To change this parameter, the user needs to specify the property Rocchio_beta in the terrier.properties file.

Fields

DFR can encapsulate the importance of the term occurrences occurring in different fields in a variety of different ways:

    1. Per-field normalisation:the frequencies from the different fields in the documents is normalised with respect to the STA Tistics of lengths typical for that field. This was as performed by the pl2f weighting model. Other Per-field normalisation models can be generated using the generic Perfieldnormweightingmodel model.
    2. Multinomial:the frequencies from the different fields is modelled in their divergence from the randomness expected by th E term's occurrences in that field. The ML2 and MDL2 models implement this weighting.
Proximity

Proximity can handled within DFR, by considering the number of occurrences of a pair of query terms within a window of Pre-defined size. In particular, the Dfrdependencescoremodifier DSM implements the PBiL and PBIL2 models, which measure the randomness Compa Red to the document's length, rather than the statistics of the pair in the corpus.

DFR Models and Cross-entropy

A different interpretation of the Gain-risk generating formula  (4)  can is explained by the notion of cross-en Tropy. Shannon ' s mathematical theory of Communication in the 1940S [5] established that the minimal average code word length is about the value of the entropy of the probabilities of the source words. This result is known under the name of the  noiseless Coding theorem . The term  noiseless  refers at the assumption of the theorem that there are no possibility of errors in TR Ansmitting words. Nevertheless, it may happen this different sources about the same information is available. In general each source produces a different coding. In such cases, we can make a comparison of the sources of the evidence using the cross-entropy. The cross entropy are minimised when the pairs of observations return the same probability density function, and in suc H A case cross-entropy coincides with the Shannon ' s entropy.

We possess the tests of randomness:the first Test is prisk and are relative to the term distribution within its E Lite set, while the second PROBM are relative to the document with respect the entire collection. The first distribution can treated as a new source of the term distribution, while the coding of the Distribution within the collection can be considered as the primary source. The definition of the cross-entropy relation of these and probabilities distribution is:

(9)

Relation (9) is indeed Relation (4) of the DFR framework. DFR models can be equivalently defined as the divergence of both probabilities measuring the amount of randomness of both di Fferent sources of evidence.

For more details on the divergence from randomness framework, your may refer to the PhD thesis of Gianni Amati, or to Am ATI and Van Rijsbergen ' s paper probabilistic models of information retrieval based on measuring divergence from random Ness, Tois 20 (4): 357-389, 2002.

[1] s.p. Harter. A probabilistic approach to automatic keyword indexing. PhD Thesis, Graduate Library, the University of Chicago, thesis No. T25146, 1974.
[2] J. Ponte and B. Croft. A Language Modeling approach in information retrieval. In the 21st ACM Sigir Conference on and development in information retrieval (Melbourne, Australia, 1998), B. Cro FT, A.moffat, and C.J van Rijsbergen, Eds., pp.275-281.
[3] s.e. Robertson and S. Walker. Some simple approximations to the 2-poisson Model for probabilistic Weighted retrieval. In Proceedings of the seventeenth annual international Acm-sigir Conference on the, and development in information Ret Rieval (Dublin, Ireland, June 1994), Springer-verlag, pp. 232-241.
[4] s.e Robertson, C.J van Risjbergen and M. Porter. Probabilistic models of indexing and searching. In information retrieval, S.E Robertson, C.J. van Risjbergen and P. Williams, Eds butterworths, 1981, ch. 4, pp . 35-56.
[5] C. Shannon and W. Weaver. The mathematical theory of communication. University of Illinois Press, Urbana, Illinois, 1949.
[6] B. He and I. Ounis. A Study of parameter tuning for term frequency normalization, in Proceedings of the twelfth International Conference on Formation and knowledge management, New Orleans, LA, USA, 2003.
[7] B. He and I. Ounis. Term Frequency normalisation Tuning for BM25 and DFR Model, in Proceedings of the 27th European Conference on Information Retrieval (ECIR ' 05), 2005.
[8] v. Plachouras and I. Ounis. Usefulness of Hyperlink Structure for Web information retrieval. In Proceedings of ACM Sigir 2004.
[9] v. Plachouras, B. He and I. Ounis. University of Glasgow in TREC 2004:experiments in the Web, robust and Terabyte tracks with Terrier. In Proceedings of the 13th Text Retrieval Conference (TREC 2004), 2004.

SOLR Similarity Algorithm III: Introduction of Drfsimilarity Framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.