This is a creation in Article, where the information may have evolved or changed.
Above we have said some inverted index of things, and also know how to implement an inverted index to complete the retrieval function, then how to sort after the search is finished, this article simply about the inverted index of the text correlation sort, because the sort is too complex, we are here to talk about the text of the relevance of the sort, And it's the simplest sort of td-idf, and then there's a chance to say something about the sorting algorithm for the whole search.
Text Relevance sorting
First understand several concepts:
Term
, the smallest unit after participle, for example 用Golang写一个搜索引擎
, after the participle is,,,, 用
golang
写
一个
搜索引擎
Then each word is a term.
TF (term Frequency), the frequency of term in the article is the current term in the article frequency, is the term number/total term number, such as the above term 搜索引擎
TF is 1/ The higher the 5,TF, the more important the word in this article is.
DF (document Frequency), which is the frequency of a term in the total document, such as a total of 100 documents, of which the term 搜索引擎
appears in 10 documents, then his IDF is 5/100=0.5.
IDF (inverse document Frequency), reverse the file frequency, listen to the name to know is and the above DF is reversed, with the total number of documents divided by the number of documents containing the term, then the logarithm can be, the above 搜索引擎
IDF is log (100/5)
How to find articles containing keywords in a pile of articles, inverted index technology has helped us to solve, as long as the segmentation is accurate, then find the article no problem. The problem is to find a bunch of articles and how to sort them out, so that the most important articles are in the first place, and here's how relevance is sorted.
TF-IDF Correlation Sort
Above we see the concept of TF and IDF, TF obvious function is to indicate a term in the article of importance, the higher the TF the more important the word in the article the more obvious, IDF, IDF mainly used to describe the importance of terms in the overall article (that is, the degree of distinction), the higher the IDF, The higher the overall importance of the term, the greater the degree of distinction, the greater the importance of the term.
Why use log? In fact, I personally think Ah, with no log In fact the difference is not so big, TF-IDF is just a calculation of text relevance of the idea, not a strict proof of the formula, so use no log difference is not very small, but from the perspective of information theory, the Demon people Shannon The amount of information presented is the LOGX, the greater the value of information, the greater the amount, just can be set in our, IDF, the greater the amount of information.
Information is what we can go to Baidu, a simple description is the probability of something happened, if something happens in the probability is P, then his information is -logp, pay attention to a negative sign, such as the Chinese team men's football team and Brazil men's football team to play games, Suppose the Chinese team win is 0.01 (probably overvalued), but if Brazil wins, according to the formula to calculate the amount of information is almost no, because everyone knows that Brazil will win, but if (I mean if) finally the Chinese team wins, then the amount of information is huge, definitely on the front page, which is also consistent with our intuition , which is the formula used in IDF, but the minus sign goes inside, it becomes log (1/p), and P is the frequency at which Df,term appears in the total document.
The combination of TF and IDF means that the term's relevance is to multiply the two values together.
Why the two concepts together, the first TF can already describe the importance of the term, why still use IDF, mainly to solve two problems.
Remove the noise of high-frequency words, since the IDF can be understood as a simple term of the information, then it is mainly to remove the noise, that is, to remove the small amount of information on the influence of the terms. such as the 的
Word, its tf is very high, but in fact there is no meaning, but you calculate his IDF, basically is 0, so if using TF*IDF words, the result is 0, can be more effective to remove this kind of common words interference.
IDF can also better distinguish the important words, if a term of the IDF higher, prove that the article with this term more able to use the term to express, this is very good understanding, if an article only in a certain articles appear, then this word more representative of this article content.
Finally, when multiple term associations are searched, their relevance is that each term's TF-IDF add up,
OK,TF-IDF is these, when the implementation, if it is the first to do a full-scale index, because the overall document number is known, that each term of the TF-IDF is generally indexed when the time to calculate it, the search time by this sort of line, I realized the time because there is no concept of full-scale index, So just in each add a document when the TF saved up the document, the retrieval time through the term inverted recall the number of documents to determine the value of IDF, real-time calculation of TF-IDF, if it is a very large number of documents, then real-time calculation is still very bad, so it is very necessary to say the full index, It's just that I didn't complete the whole index establishment, but then I'll talk about how the full-scale index is built.
Word distance
In addition to TF-IDF for correlation sorting, there are some other textual factors can also be used in the sort, one is the distance of the term, that is, the word distance, if the search keyword is 小米手机
, then obviously, if an article in the two term (millet, mobile phone) together, such as 小米手机是一款很热门的手机
and 手机应用中有很多关于健康的文章,比如吃小米有什么好处
in these two documents, it is obvious that the first article is more relevant than the second.
Therefore, in order to keep the information of the word distance, we need to save the position information of each term when we store the inverted row, and use these location information to calculate the word spacing of each word in the retrieval, thus TF-IDF together to express the text correlation.
Location information
At the same time, in addition to the word distance, there is also a factor affecting the relevance of the ranking, that is, the position of the term, this is also very good understanding, if the 标题
摘要
hit words should obviously be more than in the text hit term weight high, the general situation is to put 标题
, 摘要
The hit TD-IDF multiplied by a factor to enlarge the effect, thus affecting the final correlation calculation result.
Other models
In addition to the direct use of TF-IDF, there are now many other text relevance of the ranking model, such as the BM25 based on the probability of the ranking model, here is not expanded, if you are interested, after writing these articles can be specifically written several how to sort, including text sorting, And the importance of the text after the sorting Ah, how to offline use machine learning to calculate the importance of document sorting and so on, we will say in the sort of how to put all these things "text relevance, word spacing, location, importance, sales, click" Together to score
The following article will talk about the inverted index store some things I did not implement, such as index compression, and then talk about how to set up the inverted row, if the incremental addition of documents, how to index merging.
Finally, you are welcome to scan the following public number subscription, first issued here:)