Classic Search Model
The information retrieval model has undergone several different stages from its birth to the present, which are based on set theory, linear algebra based, statistical and probabilistic stages. Although the expert retrieval is different from the traditional information retrieval, but the two still have the very big correlation, and this article also will be based on the expert description document retrieval as the baseline, as the subsequent Optimization foundation. Therefore, it is necessary to understand the traditional retrieval model, and this article will introduce the summary of the classical models in different stages. 2.1.1.1 Boolean model
Boolean model is a simple but elegant model, which is based on the theory of set theory and Boolean algebra. It has attracted much attention in the past and has been widely used in many early commercial search engines, and in recent years it has gradually been replaced by vector space model and probabilistic model, but it still occupies a place in the field of information retrieval and is used as a good baseline.
The Boolean model is based on the following assumptions: 1. A document can be represented by a collection of words. 2. A query can be expressed as a Boolean expression that is concatenated with the And,or,not logical operator for a keyword. 3. If the index word in the document satisfies the Boolean expression of the query, then the document is relevant.
For example, the user query for "apple and jobs or ipad4", if an article contains "Apple", and also contains "jobs" or "ipad4" one of them, then this article is related to meet the needs of users.
The Boolean model is easy to understand and easy to implement, and has Boolean algebra to provide theoretical support for it.
However, the Boolean model has a fatal disadvantage, it is strictly two-dollar related (0 or 1), that is, related or completely irrelevant, can not reflect the relevant degree of difference, and it returns the result is unordered, too coarse. For example
Q=world^cup^south^africa^2010,index (d) ={world,cup,south,africa},
D will be considered irrelevant. Again like,
q=world| | cup| | south| | africa| | 2010,index (D1) ={world,cup,south,africa,2010};
Index (D2) ={2010}, at this point D1 and D2 will be considered equally relevant, although D1 is significantly more relevant. It is also impractical to ask ordinary users to have the ability to construct proper Boolean expression queries themselves. 2.1.1.2 vector space Model (VSM)
In the 70 's, the founder of the Information retrieval field Salton The VSM model, and compared with the strict two-element correlation of Boolean model, it proposes some matching retrieval strategies. As a model of document representation and similarity calculation, VSM model is widely used not only in the field of search, but also in the fields of text mining and natural language processing.
The VSM represents both the query and the document as a collection of words and maps to a high-dimensional space, where each dimension represents a word in a collection of documents, and the similarity of the space vectors represented by the query Word set and the document Word set represent the correlation of the document and query.
VSM transforms a word into a weight when it is mapped to a high-dimensional space, and the more commonly used mapping function is TF*IDF, which takes into account the occurrence of words in the document and document collections. A basic TF*IDF formula is as follows:
Ω=tfi (d) *log (N/DFI) (2-1)
where n is the number of documents in the document collection, TFI (d) is called the word frequency, the number of occurrences of the term I in document D, the DFI document frequency, and the number of documents containing the word I in the document set. According to the TF*IDF formula, the higher the frequency of a word in a document, the greater the weight, which indicates that the word represents a stronger attribute of the document; But the more documents that contain the word in the document collection, the smaller the weight, the smaller the word, which indicates that the document's properties are weaker.
In the VSM model, the calculation frame of the characteristic weights is TF*IDF, but the concrete TF,IDF calculation formula has many variants:
A TF variant formula is: WTF = 1+ log (TF), which is designed to suppress side effects caused by too much word frequency. That is, a word appears 10 times in a document, 1 words appear in another document, the formula (2-1) TF will be 10 times times different, but in fact we do not need such a big difference, choose Log to buffer, where the Formula 1 is for the word frequency of 1 o'clock smoothing.
Another TF variant formula is: The formula is for the suppression of long documents. TF represents the actual word frequency of words in a document, and Max (TF) represents the word frequency of the most frequently-expressed words in the document. α is a regulatory factor, and new studies have shown that α takes 0.4 better.
An IDF variant formula is:, each meaning is the same as the formula (2-1), where 1 is to smooth the condition when DF equals N.
According to the TF*IDF calculation framework, if the TF and IDF of a word can be divided into four quadrants according to High and low, the composite weights of the words are shown in table 2.1.
|
TF High |
TF Low |
IDF High |
High |
So so |
IDF Low |
So so |
Low |
VSM assumes that documents are more similar in high-dimensional spaces and that documents are more relevant in content. The similarity of the query and the document can be expressed by the angle cosine of the vector that maps to the high-dimensional space:
The more similar the D and Q, the smaller the angle of the vectors between them, the greater the Cos (D,Q), the larger the angle of the vectors between them, the more different the two vectors are, the lower the similarity of D and Q. However, there is an obvious flaw in the formula (2-2), that is, excessive suppression of long documents: assuming the length of two documents and the search topic related to the weight of the word weights, but the long document also discusses the other topics, then the Cosin formula molecular parts are basically the same, but the denominator part length document vector length is larger than the short file Causes Cosin (d length, q) to be less than Cosin (d short, q).
In addition to the use of cosine to calculate the similarity, there are also the following common similarity calculation method can be used, as shown in table 2.2.
Table 2.2 Table of common similarity calculation methods
Method Name |
Calculation Formula |
Simple matching |
|
Dice ' s coefficient |
|
Jaccard ' s coefficient |
|
Overlap coefficient |
|
By calculating the similarity between the query and each document, you can then return the document sort as a result, depending on the similarity. VSM is also easy to implement, and it returns related documents can be sorted, but when the vector dimension is too high when the calculation will be very time-consuming, system resource utilization will be more unusual; VSM has a basic hypothesis that the term vectors are considered to be irrelevant, but in fact the term is mostly dependent. , for example, there is a close relationship between words in natural language, and the researchers later put forward N-meta-language model to improve it. In addition, VSM is an empirical model, which is based on experience and intuition, and the theoretical support is not very sufficient.
2.1.1.3 Probabilistic Model
The probabilistic model was proposed by Robertson and Sparck Jones in 1976, and they used the relevant feedback information to obtain the desired results in a gradual refinement. The basic idea of probability model is: According to query Q, the documents in the document collection are divided into two categories, the Q-related set R, and the Q-unrelated set R '. For a document set of the same class, the distribution of each index item is the same or similar, and for different classes of document sets, the individual index items are distributed differently. Thus, the distribution of the various index items in the document is calculated, according to the calculated distribution, we can determine the relevance of the document and query, namely:
where P (r=1), P (r=0) and the specific Q correlation, that is, p (r=1)/P (r=0) is fixed. P (d| r=1) for the probability of document D appearing in a document set related to Q, P (d| r=0) is the probability of the occurrence of D in a document set that is not related to Q.
The most common probabilistic model formula so far is the BM25 formula proposed by Robertson,
Where QTF is the word frequency in the query, TF represents the word frequency in document D, is the normalization of the document length, the last item is a smooth log (), is the inverse document frequency of the word T. K1,B,K3 is the empirical parameter.
Probabilistic model has the strict Mathematical theory Foundation, the theory is strong, is one of the most effective models, in TREC and other evaluation projects have confirmed this point, but the model of the text set is too dependent, the formula requires parameter estimation, and relies on two yuan independence hypothesis. 2.1.1.4 Language Model
The language model has been successfully applied in speech recognition, machine translation and Chinese word segmentation before it is used in information retrieval (information Retreval), which has the advantages of high accuracy, easy training and easy maintenance.
Language model modeling methods are broadly divided into two categories: one is to rely entirely on large-scale textual data for statistical modeling, and the other is a deterministic language model based on Chomsky's formal language, which focuses more on grammatical analysis.
From the basic idea, other retrieval models are considered from the query to the document, that is, the given user query how to find the relevant documents. However, the language model is opposite, is a kind of reverse thinking mode, that is, from the document to the query to consider, for each document to establish a different language model, determine the probability of the query generated by the document, according to the probability of the size of the order as the final search results.
After applying to IR, the language model is closely linked to the document, and when the query q is entered, the document is sorted according to the probability of the query likelihood or the likelihood that the document will produce the query under the language model. If you assume that the terms are independent of each other, the use of a one-language model is:
Where query q includes the word T1,..., TM.
However, the language model faces data sparse problem, that is, the query Word does not appear in the document, the entire generation probability will be 0, so the language model introduces data smoothing, that is, the word distribution probability "maxi", so that all words have a nonzero probability value. Language retrieval models often introduce a background probability of a document set as data smoothing. Background probability is a whole language model of document set, because the size of document set is large, so most of the query words appear, avoid 0 probability. If the document collection also uses a meta-language model, the background probability of a word is the number of occurrences of that word divided by the number of occurrences of all words in the document set. The probability formula for generating query words that join data smoothing is
where P (ti| C) is TI's background language model, Tf/len is TI's document Language model, the whole probability consists of two parts linear interpolation, λ belongs to [0,1] is a regulatory factor.
Later, there were many variants of the language model in IR, some of which were beyond the scope of the query likelihood probability model, such as the KL distance based model. At present, the language model retrieval method is slightly better than the vector space model after the parameter optimization, and the effect of BM25 and other probability models is flat.
DFR (Divergencefrom randomness Models) model was proposed by Gianni Amati, Keith van Rijsbergen in 2002, it is based on statistics and information theory, is one of the more effective retrieval models today, And it is a statistical retrieval model of Parameter-free.
The INF (tf|d) represents the amount of information that each word in document D contains, and sorting the document means that the document containing the more information is queued to the front.
Where TF is the word frequency in the document Set, L (d) is the document D length, TF is the word frequency in the document of length L (d), the total number of words in the TFC document set.
There are two types of DFR: the model for each document (type I) and the model for the Document Set (TYPEII).
Type I DLH uses entropy as the word weight of the word t:
The Type II DPH uses another method as the word weight of T:
Search Model Evaluation Index
The two basic indicators most commonly used in information retrieval are accuracy and recall rates, as shown in Figure 2.3:
Fig. 2.3 Schematic diagram of accuracy rate and recall rate
The definition of accuracy and recall rate is as follows:
accuracy = number of related documents in the returned result/number of results returned (2-8)
Recall rate = Number of related documents in the returned result/number of all related documents (2-9)
An indicator that incorporates the correct rate and recall rate is the F value:
The commonly used evaluation indexes in TREC are:
<1 said the emphasis on accuracy, >1 said the recall rate. Usually F takes 1 that is accurate recall rate equally regarded, at this time:
The commonly used evaluation indexes in TREC are:
1. Average accuracy rate AP
The AP represents the average of the exact rate for each related document, where RQ represents and queries the number of documents related to Q, Docq (i) indicates the total number of documents that have been retrieved when the relevant document of article I is retrieved, and can also be understood as the location of the relevant document in article I.
2.MAP (Mean Average Precision)
The map represents the average accuracy of the query set on the AP basis:
3.p@n
Represents the proportion of related documents in the highest ranked N documents.
4.NDCG (normalized discounted cumulativegain)
Each document is not only relevant and unrelated, but can be divided into different levels of relevance, such as the previously mentioned level five correlation, 0,1,2,3,4. NDCG calculation is more complex, divided into the following steps, namely CG (Calculatedgain), DCG (discounted calculated Gain), NDCG.
DCG is an evaluation indicator of multivalued correlation and includes a location discount factor.
Where the sorted list of query q is, the document representing the position in R, is the benefit of the document, generally set = 2l (r)-1, where L (r) is the correlation degree of the document R, and the position discount factor is generally set to.
NDCG is the normalization of DCG:
Where ZK is the DCG value of the document's ideal descending sort. Here is a column, as shown in table 2.2 below.
Table 2.2 accuracy rate and recall rate
Doc Id |
Level of correlation |
Gain |
Cg |
DCG |
Max DCG |
NDCG |
1 |
5 |
31 |
31 |
31=31*1 |
31 |
1=31/31 |
2 |
2 |
3 |
34 |
32.9=31+3*0.63 |
40.5 |
0.81 |
3 |
4 |
15 |
49 |
40.4=32.9+15*0.5 |
48.0 |
0.84 |
4 |
4 |
15 |
64 |
46.9 |
54.5 |
0.86 |
5 |
4 |
15 |
79 |
52.7 |
60.4 |
0.87 |
6 |
... |
... |
... |
... |
... |
... |