Overview of information retrieval
Information retrieval is a kind of technology which is widely used at present, and retrieval of papers and search engines belong to the category of information retrieval. In general, people abstract the question of information retrieval: On the Document Set D, for query string composed by keyword W ... w[k, returns a list of related documents sorted by query Q and document D match degree relevance (q, D), d '.
For this problem, there have been a variety of classical information retrieval models, such as Boolean model, vector model, and they put forward their own solutions from different angles. Boolean model based on the Boolean operation of the set, the query efficiency is high, but the model is too simple, can not effectively sort the different documents, the query effect is poor. The vector model treats both the document and the query string as a multidimensional vector of words, and the relevance of the document to the query corresponds to the angle between the vectors. However, due to the large number of common words, vector dimensions are very high, and a large number of dimensions are 0, the calculation of the angle of the vector effect is not good. In addition, the large amount of computation makes the vector model almost does not have in the Internet search engine such a massive data set implementation of the feasibility.
At present, the TF-IDF model is widely used in real applications such as search engines. The main idea of the TF-IDF model is that if the word W appears high in a document D and rarely appears in other documents, the word w is considered to be a good differentiator and is suitable for separating article D from other articles. The model consists of two main factors:
1) Word w in document D, the word frequency tf (term Frequency), which is the ratio of the number of words W in document D to count (W, D) and the total number of words in document D (d):
2) The inverse document frequency of the word w in the entire document collection IDF (inverse document Frequency), which is the logarithm of the total number of documents N and the number of files that appear in the word w docs (W, D) ratio:
The TF-IDF model calculates a weight based on the query string Q composed of the TF and IDF for each document D and the keyword W...w[k], which is used to represent the matching degree of query string Q to document D:
The probability perspective of information retrieval problem
Intuitively, TF describes the frequency at which the document Morphemes appears, while IDF is the weight associated with the number of documents that the word appears. It is easier to understand the basic idea of TF-IDF in a qualitative way, but some details of TF-IDF are not so easy to explain. Like what:
1) Why is TF count (W, D)/size (d)? Can it be a log (count (w, D)/size (d)) and other forms?
2) Why is IDF a log form?
3) Why is the product relationship between the TF and the IDF, not the addition or exponential relationship?
4) Why is the TF-IDF value of multiple keywords an additive relationship, not a multiplication or exponential relationship?
5) In addition to the TF-IDF value, Google calculates the PageRank value of the page, multiplying it by the final weight, why multiplication, not addition or exponent?
It is said that at first, even the TF-IDF themselves did not have a strong explanation for the issue of why IDF was a log form, although some of the other doubts remained as a convincing explanation of the IDF's log form from the perspective of information theory. Within the scope of my understanding, there is not a truly unified and complete theoretical explanation for the TF-IDF model. In the process of trying to find a better theoretical explanation for TF-IDF, I realized that the root of all the doubts about the TF-IDF model is that the "query Q and document matching degree" itself has a certain fuzziness, what is called "matching degree", which has a lot of free play space. If the vector model uses the vector angle to represent the matching degree concept also has the certain Theory Foundation, then uses the TF-IDF to express the matching degree to be a bit "is not so much the science, is the art" the flavor.
Further, in fact, the abstract way of information retrieval problem "in the document collection D, for a given query string q, the return of a query Q and document D matching degree relevance (q, D) sorted by the relevant document List d '" itself is worth reflecting on. We should consider abandoning the vague goal of "matching degree" and seek a goal with definite mathematical meaning from the root. If we look at the probability point of view, the question of "query string Q and document D match degree" to "when the query string is Q, the user expects to get the probability of document D" problem , the information retrieval problem is much clearer. On the one hand, this probabilistic description is to stand in the human perspective of information retrieval problem, more close to the actual user experience, on the other hand, the probability itself has a definite mathematical significance, so we first from the goal of the problem is strictly.
Below, I will pass a model, from the perspective of probability, while explaining the probability meaning of TF-IDF, while pointing out its unreasonable.
Box Ball Model
In order to analyze the question of the probability that the user expects to obtain document D when the query string is Q, I first set up a simplified model called the box-ball model. Box ball model to think of the word as a variety of different colors of the ball, the document is imagined as a box containing a number of small balls, "when the query string is Q, the user expected to obtain the probability of document D" to the following question:
There are n boxes d, d, ... d[n], each box has a number of different colors of the ball, someone randomly selected a box, and from the box randomly took out a color of w[j] ball, then the ball from the box d[i] the probability of how much?
In fact, this is the classical conditional probability problem P (D[i] | w[j]), using Bayesian inference to convert it into:
We notice that this conditional probability consists of several parts, p (D[i]) is the pre-probability of the box D[i] being selected, p (W[j]) is the pre-probability of the w[j] color ball being selected, p (W[j] | d[i]) is the conditional probability of the ball in the box D[i] selected color W[j].
Document prior probability P (d) and PageRank
First, let's look at the box D[i] what is the priori probability P (d[i]) selected. The meaning of P (D[i]) is the probability that it may be interested in the document D[i] when the user does not enter anything. In the absence of more information, we can assume that the prior probability P (D[i]) of each box being selected is equal and equal to 1/m, where m represents the total number of documents (total boxes), at which point P (D[i]) can be ignored as a public factor. However, in practical applications, we can usually obtain a priori probability of each document based on other knowledge, for example, academic literature and Web pages can often calculate their prior probabilities based on the citation model, which most people are happy to see. In this case, you may have discovered that Google PageRank is essentially the transcendental probability P (d[i]) multiplied by a factor! So, PageRank is actually included in this conditional probabilistic model, which makes it easy to explain why PageRank weights and tf-idf weights in Google's sorting algorithms are a product relationship rather than a plus or an exponential relationship. On the other hand, after understanding the effect of the document prior probability on the probability of the whole search result, when the search engine pagerank a variety of fake link seo, we can not rigidly adhere to the model based on the link PageRank, as long as the Web page prior to the goal of the probability, Whether it's a link-based PageRank or a search-results-based click-through model, or any other model, it's all possible. This is "flexible", from the principle of "pass", you Can "change" in the method.
A priori probability of a word p (w)
Let's examine the transcendental probability P (w[j]) of the word w[j]. P (W[j]) is the meaning of: in the entire document collection, W[j] is used as the probability of searching keywords, such as: "IPhone 5", "Blue and white porcelain" such words are used as the probability of searching keywords is higher, and "the", "what", "we" such high-frequency words are unlikely to become a search keyword. So how do we quantitatively calculate P (w[j])? One way of thinking is to use the frequency of w[j] in the document set as its prior probability. There is, however, a better scenario: statistically, in a large number of search queries, the method of statistical methods to derive P (W[j]) is close to the essence of P (W[j]), and no additional assumptions need to be introduced. For example, a search engine in a period of time the total number of searches for 10^10 times, "Provident Fund" the word appeared 100 times, then, we can think of a priori probability p ("Provident Fund") is 100/10^10 = 10^-8.
Conditional probability P (W | d) representing the document subject
Finally, let's look at the conditional probability P (W[j] | d[i]). The meaning of P (W[j] | d[i]) is the probability that people use the keyword W[j] to search for it in the document D[i]. So, what is the word that people use to search for a document? In most cases, these are the words that represent the subject of a document. For example, there is a news article about the iphone 5 launch, then "IPhone5", "press conference", "Cook", "apple" these words basically constitute the theme of the article, then, conversely, if users want to search this news about the iphone 5 launch, He has a great chance of searching through these words. We should pay attention to the difference between P (W[j] | d[i]) and P (W[j]), which can be obtained by a large number of queries, while the former cannot be directly equated with the latter, because the former meaning is w[j] the probability of the subject of d[i. If you want to introduce statistical methods, then P (W[j] | d[i]) corresponds to the statistics: when the search keyword is w[j] and the search results include D[i], the user clicks (satisfied) d[i] as the frequency of search results. For example, with the "IPhone5 launch" search, in the results there are 10,000 times the page x, wherein the user 8,000 times to click on the page x, then, you can think of a 80% probability page x theme is about the "IPhone5 conference".
The amount of word and IDF
The above is a statistical method for the calculation of P (W[j] | d[i), but the method has some limitations, for example, to be able to perform statistics first requires that the document appear in enough search results, which requires time and quantity accumulation. In addition to statistical methods, we can consider other ways to calculate the probability of the word w[j] representing the document D[i] topic. Some people may immediately think of the article for semantic analysis to extract keywords, to these keywords high weight, to other words low weight. This idea has certain rationality, but the realization involves the semantic analysis, does not have the mature effective method. In fact, information theory provides us with another efficient solution. These high-frequency words, "what", "we", are not the subject of documents and search keywords because they do not provide enough information, and the words "IPhone 5" and "conference" are informative. The so-called information refers to the degree of uncertainty (entropy) reduction, the unit is bit (bit), the greater the amount of information to reduce the degree of uncertainty is greater. For example, the outside may rain or not rain, the probability of space is 2, if we look out of the window, the possibility of space becomes 1, then "see the rain outside the window" provides the amount of information is proportional to the degree of entropy reduction, specifically equivalent to log (2/1) = 1. If you want to use binary coding for rain, it takes 1 bit,0 to represent no rain and 1 for rain.
But in many scenarios, the probabilities are different, for example: the European Cup 16 teams are likely to win, before the tournament they won the prior probability is not the same, then the result of uncertainty is actually less than log (16) = 4. If you don't watch the game and someone tells you that Spain has won, you may feel normal, but if someone tells you that Switzerland has won, you will usually be very surprised. The theoretical explanation of this phenomenon is that if the odds of Spain winning before the game is 1/4, and the probability of winning in Switzerland is 1/32, then the amount of "Spain wins" is log (4) = 2, that is, the uncertainty is reduced to the original 1/4, and "Switzerland wins" the amount of information is log (32) = 5, the uncertainty is reduced to the original 1/32, and suddenly accept a greater than the former twice times more information, of course, you will be surprised.
Back to information retrieval, for example, "2012 US election" This query string contains "2012", "United States" and "general election" 3 keywords, how should we quantitatively calculate their amount of information? According to the definition of information, the amount of the word is equal to its degree of reduction of uncertainty. If the total number of documents is 2^30, where the 2^14 document appears "United States", then the word "United States" reduces the uncertainty of the document from 2^30 to 2^14, which contains the amount of information that is log (2^30/2^14) = 16, and only the 2^10 document appears "general election", So the general election information is log (2^30/2^10) = 20, more than the "United States" more than 4 bit. And the "What", "we" these high-frequency words have little help in reducing document uncertainty, and thus the amount of information is 0. I believe you have found that the log (N/docs (w, D)) in the IDF (W) formula above is actually the amount of the word W.
If we consider the effect of a word's amount of information on conditional probability P (W[j] | d[i]), suppose that the probability of the word w being selected in the document is proportional to its frequency of occurrence in the document and the product of its amount of information, then the above conditional probability model becomes:
We see that TF-IDF has been incorporated into the framework, but there are also more documents prior to the probability P (d[i]), the keyword priori probability p (w[j]), and the total TF-IDF (D[i]) of the document words. The general search engine is based on PageRank and TF-IDF, then, according to this probabilistic model, we can see that it does not consider the document Total TF-IDF (D[i]) and the keyword priori probability p (w[j]). If you consider these two factors, believe that the search effect will be better.
The above conditional probability model is mainly for the case of a single keyword, below we further extend it to the multi-keyword case. We know that in TF-IDF, the TF-IDF value generated by multiple keywords is an overlay relationship, so does this fit the conditional probability model? The answer is in the negative. In the case of two keywords, the conditional probability problem translates to "what is the probability that if someone touches a small ball of color w[x] and a small ball of color w[y] from a box, what are the odds of the two balls coming from the box D[i]? ”。 Assuming that each ball event is isolated from the box,
We can push and export conditional probabilities:
It can be seen that the TF-IDF value of each keyword derived from the probabilistic model is a product relationship, which is different from the addition of the TF-IDF model. This may be related to the basic assumption that the document must contain all query keywords. In this case where the document does not contain all of the keywords, the TF-IDF model may have a non-0 matching degree, but the probability of the conditional probability model is definitely 0. However, if you consider the General Query keyword number (less than 3), and a large number of documents contain these keywords, the product relationship of the probability model is more theoretical basis than the TF-IDF model of the addition relationship. Fundamentally, this is because TF-IDF's "match degree" is an ambiguous concept, and conditional probabilities have a solid theoretical basis.
TF-IDF model is a widely used information retrieval model in search engine, but there are always many questions about TF-IDF model. In this paper, a box-ball model based on conditional probability, the core idea is to turn "query string Q and document D's matching degree" into "conditional probability problem of query string Q from Document D". It defines the goal that the matching degree expressed by the TF-IDF model is clearer than that of the information retrieval problem from the perspective of probability. From the probabilistic model, we see that the conditional probabilities of query string Q from Document D mainly include the following factors: 1) the prior probability P (D[i]) of the document, which corresponds to PageRank; 2) The word w is used as the prior probability P (w) of the search term, which can be obtained by statistical method; 3) The keyword W represents the document D topic, or the probability of searching for document D with the word W, P (W | d), except for statistical methods, which can be computed by TF-IDF.
Probability interpretation of TF-IDF model