Search engine Retrieval model-correlation calculation of query and document

Source: Internet
Author: User
Tags idf

1. Overview of the Retrieval model

search results sorted by the most important part of the search engine, to a large extent determines the quality of the search engine and user satisfaction. The actual search results are sorted by a number of factors, but the most important factor is the relevance of user queries and Web content, as well as the link of the page. Here we mainly summarize the content of Web pages and user inquiries related content.

Determine whether the content of the Web page is related to the user inquiry, which relies on the search engine to use the search model. Retrieval model is the theoretical basis of search engine, which provides a mathematical model for quantitative correlation and a framework and method for computing similarity between query words and documents. Its essence is the correlation degree modeling. , the search engine system architecture location for the search model:

Of course, there is an idealized implicit hypothesis in the research of retrieval model theory, and it is assumed that user requirements have been expressed clearly and clearly by query, so the task of retrieving model does not involve modeling user requirements. But in fact this and the actual difference is far, even if the same query word, different users of the requirements may vary greatly, and the retrieval model is powerless.

2. Search model classification The mathematical model of university studies (Kang Qiyuan, Third edition), is now a bit of an impression. The mathematic model boils down to the corresponding mathematics problem, and based on this, it uses the concept, method and theory of mathematics to analyze and study deeply, so as to depict practical problems from qualitative or quantitative angle, and provide accurate data or reliable guidance for solving real problems.
So we divide from the mathematical method used: 1) based on the set theory IRModel (Set theoretic models)Model extension Boolean model based on fuzzy set of Boolean Model 2) based on algebraic theory IRModel (algebraic models)Neural network model of latent semantic index model of vector space Model 3) based on probability statistics IRModel (probabilistic models)Modeling of probabilistic model language model for regression model IRModel Inference Network Model Trust degree network model

In addition there is a statistical-based machine learning sequencing algorithm.
This paper mainly introduces Boolean model, vector space model, probabilistic model, language model, machine learning sorting algorithm.

3. Boolean model

Boolean model:

Is the simplest information retrieval model, which is a simple retrieval model based on set theory and Boolean algebra.

Basic idea:

Documents and User queries are represented by the set of words they contain, and the similarity of the two is judged by Boolean algebra operations;

Calculation of similarity:

The query Boolean expression matches the Boolean expression for all documents, and the score for the successful document is 1, otherwise 0.

such as query words:

Apple and (iphone OR Ipad2)

Document Collection:

D1:iphone 5 was published on September 13.

D2: Apple released its next iphone on September 13.

D3:ipad2 will be listed in the US on March 11.

D4:iphone and ipad2 are beautifully designed and stylish

D5:80 like the iphone, but doesn't like apples.

Then the word is related to the document such as:

The result is that D2 and D5 meet the search criteria.
This is similar to the traditional database retrieval, which is the exact horse. Some search engine advanced retrieval is often the idea of using a Boolean model. Such as Google's Advanced search.
Advantages:
It is simple in form and easy in structure.

Disadvantages:

1) An exact match can cause too many or too few documents to be checked out. Because the Boolean model simply determines that the document is either related or irrelevant, its retrieval strategy is based on the two-value criterion and cannot describe the case that matches the query condition. Therefore, the Boolean model is actually a numeric retrieval model rather than an information retrieval model.

2) Although Boolean expressions have exact semantics, it is often difficult to translate the user's information requirements into Boolean expressions. Nowadays, it is widely believed that the weighting of index words can greatly improve the retrieval effect. The vector model is derived from the method of weighting index words.

4. Vector space Model (MODEL,VSM) vector space model:
Cornell University SaltonEt man last century -The prototype system was presented and advocated SMART

Basic idea:

The document is regarded as a vector composed of t-dimensional features, generally using words, each feature will be based on a certain basis to calculate its weight, the T-dimension with the weight of the features together constitute a document, to represent the subject content of the document.

Similarity calculation:

The similarity of the computed document can be defined by cosine calculation, in fact, it is to find the angle between the query word vector and document vector in the T dimension space, the smaller the more similar; for the feature weight, the TF*IDF frame can be used, TF is the frequency of words, IDF is the inverse document frequency factor refers to the number of occurrences of the same word in the document collection range, which is a global factor that takes into account not the characteristics of the document itself, but rather the relative importance of the feature word, the more the number of documents in which the feature word appears, the lower the IDF value, the worse the word's ability to differentiate between documents This framework generally takes WEIGHT=TF*IDF as the weight calculation formula.

Ideas:

1) Vector representation: The vector of a document DJ can be expressed as a DJ (W1J, w2j,?, WNJ), where n is the number of words in the system, and Wij represents the weight of the indexing I in the document DJ. The vector of query Q can be expressed as Q (w1q, w2q,?, Wnq), wiq represents the weight of the word I in query Q 2) document -Word matrix (doc-term Matrix) :
N
Document, mA matrix consisting of an index word Am*n, each column can be viewed as a vector representation of each document, and each line can also be viewed as a vector representation of the word:

3) Weight calculation:
Boolean weight: The weight of the index I in document j Wij = 0 or 1 (takes 1, otherwise 0) tf weight: tf (term Frequency) is the number of times the word appears in the document. Weight Wij = Tfij or normalized tf value TF normalization (normalization): The TF value of all the cited words in a document is normalized to [0,1]. You can usually use one of the following methods:
1:WTF = 1 + log (tf) 2:WTF = a + (1-a) * Tf/max (TF) where a is the regulator factor, the empirical value a=0.5 The latest research shows that 0.4 is better.
The document frequency of the word DF (Doc Frequency): The number of document articles that the word appears in the entire document collection, DF reflects the distinction of the word, the higher the DF, the more common the word, and the lower its weight is. Inverse document frequency (inverse df, IDF): The reciprocal of DF, usually calculated using the following formula: (n is the number of all documents in the document collection)
3) Calculation weights: The vector space model usually uses the tf* IDF method to calculate the weight, that is, the subscript i in the document DJ weight wij = Tfij * Idfij.
4) Similarity calculation: The degree of correlation (i.e. similarity) between the document and the query term can be determined by the relative position of their respective vectors in the vector null question. There are many kinds of similarity calculation functions, and the cosine function of the angle of two vectors is more commonly used.

defined by the quantity product of a vector: The quantity product of two vectors (also known as "inner product", "dot product", called "scalar product" in physics). ) is a quantity, recorded as a B。 If abNo collinear, then a B=| a|·| b|· Cos〈 ab〉。
Its meaning: The quantity product of two vectors is equal to the product of one of the vectors and the projection of the other vector in the direction of the vector. We call |b|cosθ a projection of vector b in the direction of vector a.
Two vectors aAnd bThe Quantity product: a· b=| a|*| b|cosθ, where | a|, |β| is the modulus of two vectors, θ is the angle between two vectors (0≤θ≤π).
If you have coordinates a(X1,Y1,Z1); b(X2,Y2,Z2), then a· b=X1X2+Y1Y2+Z1Z2; | a|=SQRT (x1^2+y1^2+z1^2); | b|=sqrt (x2^2+y2^2+z2^2).

according to Definition: Cos〈a,b〉=a b/|a| | b|); If A and B are collinear, then a b=+-∣a∣∣b∣.
Its nature:
1) A
· a=| a| squared.
2) ab〈=〉 a· b= 0.

The similarity value of the document and the question is then obtained from the following formula:

To understand cosine similarity, you can speak each document and query as a numeric point of the T-dimensional feature space. Each feature forms a dimension in the T-dimensional space, the link feature space origin and the value point form a vector, and the cosine similarity is the calculation of the angle between the two vectors in the feature space. The smaller the angle, the more similar the content of the two eigenvectors. The extreme case is two identical documents, the two vectors in the vector space are overlapping, the cosine similarity value is 1. Example: Query Q(<2006:1>,<World cup :2>)Document D1(<2006:1>,<World cup :3>,<Germany :1>,<Held :1>)Document D2(<2002:1>,<World cup :2>,<Korea :1>,<Japan :1> <Held: 1>)
Inverted index list:

Calculation of similarity of vectors in query and document: Using inner product of D1 and Q of inner product: 1*1+3*2=7 document D2 and Q inner product: 2*2=4 angle cosine: Angle cosine of document D1 and Q: Angle between document D2 and Q Cosine:

Advantages: 1) Simple and intuitive, can be applied to many other areas (Text classification, bioinformatics ),Mail filtering system Spamassass. 3) support partial match and approximate match, result can be sorted 4) the results are good. Disadvantages: 1) The amount of calculation is large 2) the different positions of the words will represent different weights, and the different keyword lengths will affect the size of the weights.
3) The hypothesis of independence between words is inconsistent with reality: in fact, WordThere is a relationship between the appearance of the, not completely independent. Such as: the appearance of "Wang Liqin" "Ping pong" is not independent. 5. Probabilistic Models

Probabilistic models:

is one of the most effective models, Okapi BM25 This classical probabilistic model calculation formula has been widely used in search engine Web page sorting. Probabilistic retrieval model is derived from the principle of probability sequencing.

Basic assumptions and theories:
1). Dependency Independence principle: The relevance of a document to a search formula is independent of the other documents in the Literature collection.
2). The independence of words: words and search morphemes and words are independent of each other. That is, there is no association between the words appearing in the document, and the probability of the distribution of any word in the document does not depend on other words .
3). The relevance of the document is two values : that is, only relevant and unrelated.
4). Probability sequencing principle: The principle is that the retrieval system should sort the document according to the size of the probability dependency of the query, then the document that is most likely to be retrieved
5). Bayes theorem : expressed in formulas as:
P (r|d) = (d| R) · P (R)/P (d)

The basic idea is:

is the probability of the method to connect the query and the document, given a user query , if the search system can be sorted in the search results according to the document and user requirements of the relevance of the high end of the order, then the accuracy of the search system is optimal. Estimating this correlation as accurately as possible on the basis of a collection of documents is the core.

Calculation of similarity:
The query Q and document D are represented as two-value vectors according to whether there are words, q={q1,q2,...},d={d1,d2,...},di=0 or 1 indicates that there are no or I words in the literature. It is related to the literature by R, which indicates that the literature is irrelevant.
Conditional probability P (R|DJ) indicates the probability that the document DJ is related to the query Qi

Conditional probability P (|DJ) indicates the probability that the document DJ is unrelated to the query Qi

Use their ratios to calculate the similarity between documents and queries.
If P (r|d) > P (|d), that is, the ratio is greater than 1, the relative degree of the literature is greater than the degree of relevance, the literature D is considered relevant, otherwise the literature d is not considered relevant. When the two are equal, it is artificially considered to be irrelevant.

Advantages:
1. The use of rigorous mathematical theory as the basis for people to provide a mathematical theoretical basis for the search decision-making; PubMed related articles.
2. Using the relevant feedback principle
3. In which there is no use of the user difficult Boolean logic method;
4. The dependence and interrelationship of words are used in the course of operation.
Disadvantages:
1. High computational complexity, not suitable for large networks
2. Parameter estimation is difficult
3. Conditional probability values are difficult to estimate
4. System retrieval performance is not obvious, need to be combined with other search models

6. Language Model language model:
It is the result of merging language model and information retrieval model by using language model technology adopted in Speech recognition field.
Basic idea:
The other way of thinking about the retrieval model is from EnquiryTo a document, that is, a given user Enquiry, how to find out the relevant documentation, the idea of the model is just the thought, is from the document to EnquiryIn this direction, a different language model is established for each document, and the user is judged by the document generation Enquiry, and then sort by high-to-low based on this generation probability, as search results. Language models represent the distribution of words or sequences of words in a document;

7. Machine learning sequencing algorithm machine learning sorting algorithm:
With the development of the search engine, for a Web page to be considered more and more factors, this is not done according to human experience, at this time with machine learning is very appropriate, such as Google's current page sorting formula to consider more than 200 kinds of factors. The data sources required for machine learning are well-satisfied in search engines, such as the user's search-click history. It consists of 4 steps such as manual annotation training, document feature extraction, learning classification function and machine learning model in real search system. Manual annotation Training allows the user to click on a record to simulate a mechanism for scoring a document-related score. http://blog.csdn.net/hguisu/article/details/7981145

Search engine Retrieval model-correlation calculation of query and document

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.