TF-IDF and its algorithm
Concept
TF-IDF (term frequency–inverse document frequency) is a commonly used weighted technique for information retrieval and information mining. TF-IDF is a statistical method used to evaluate the importance of a word to one of the files in a set of files or a corpus. The importance of a word increases in proportion to the number of times it appears in the file, but it decreases inversely as it appears in the corpus. The various forms of tf-idf weighting are often searched engine applications as a measure or rating of the degree of relevance between a file and a user query. In addition to TF-IDF, the search engine on the Internet uses a rating method based on link analysis to determine the order in which the files appear in the search results.
principle
In a given document, the word frequency (term frequency, TF) refers to the number of occurrences of a given term in the file. This number is usually normalized (the numerator is generally less than the denominator differs from the IDF) to prevent it from favoring long files. (The same term may have a higher word frequency than a short document in a long document, regardless of whether the word is important or not.) )
Reverse file frequency (inverse document frequency, IDF) is a measure of the universal importance of a word. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and then obtained by the quotient logarithm.
The high-word frequency within a particular file, and the low file frequency of the word in the entire set of files, can produce high-weight tf-idf. As a result, TF-IDF tends to filter out common words and retain important words.
The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification. TFIDF is actually: TF * IDF,TF Word frequency (term Frequency), IDF Anti-document frequencies (inverse document Frequency). The TF represents the frequency at which the entry appears in document D (the other is: the TF word frequency(term Frequency) refers to the number of occurrences of a given term in the file). The main idea of IDF is that if the fewer documents that contain the entry T, that is, the smaller the n, the larger the IDF, the better the class-distinguishing ability of the term T. If the number of documents containing the term T in a class of document C is M, and the total number of documents containing T in the other class is K, it is clear that all documents containing T are n=m+k, when M is large, n is also large, and the IDF value obtained by the IDF formula is small, indicating that the term T category is not strong. (Another said: TheIDF anti-document frequency (inverse document Frequency) is that the fewer documents contain the entry, the greater the IDF, the better the class-distinguishing ability of the term. But in fact, if an entry is frequently present in a document of a class, it indicates that the term is a good representation of the character of the text of the class, which should be given a higher weight and selected as the characteristic word of the text to distinguish it from other classes of documents. This is where the IDF is deficient.
In a given document, the word frequency (term FREQUENCY,TF) refers to how often a given term appears in the file. This number is normalized to the number of words (term count) to prevent it from favoring long files. (The same word may have a higher number of words in a long document than a short document, regardless of whether the word is important or not.) For words in a particular document, the importance of it can be expressed as:
The above is the number of occurrences of the word in a file, and the denominator is the sum of the occurrences of all words in the file.
The Inverse file frequency (inverse document FREQUENCY,IDF) is a measure of the universal importance of a word. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and the obtained quotient logarithm is obtained:
Among them | d|: Total number of files in Corpus: Number of files that contain words (that is, the number of files) if the term is not in the corpus, it will result in a divisor of zero, so it is generally used
And then
The high-word frequency within a particular file, and the low file frequency of the word in the entire set of files, can produce high-weight tf-idf. As a result, TF-IDF tends to filter out common words and retain important words.
Example
A: There are many different mathematical formulas that can be used to calculate TF-IDF. The example here is calculated using the above mathematical formula. Word frequency (TF) is the number of occurrences of a term divided by the sum of the words in the file. If the total number of words in a document is 100, and the word "cow" appears 3 times, then the word "cow" in the document is 3/100=0.03. One way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow" and then divided by the total number of files contained in the file set. Therefore, if the word "cow" appeared in 1,000 documents, and the total number of documents is 10,000,000, the reverse file frequency is log (10,000,000/1,000) = 4. The final TF-IDF score is 0.03 * 4=0.12.
Second: The relevance of the search results based on the keyword K1,K2,K3 becomes tf1*idf1 + tf2*idf2 + tf3*idf3. For example, the total term of Document1 for 1000,K1,K2,K3 in Document1 the number of occurrences is 100,200, 50. Contains the K1, K2, K3 docuement Total is 1000, 10000,5000. The total of document set is 10000. TF1 = 100/1000 = 0.1 TF2 = 200/1000 = 0.2 TF3 = 50/1000 = 0.05 IDF1 = log (10000/1000) = log (Ten) = 2.3 IDF2 = log (10000/100 = log (1) = 0; IDF3 = log (10000/5000) = log (2) = 0.69 Such a keyword K1,K2,K3 related to docuement1 = 0.1*2.3 + 0.2*0 + 0.05*0.69 = 0.2645 where K1 is more specific than K3 in Docum Ent1 to large, the proportion of K2 is 0.
Three: In a 1000-word page, "Atomic energy", "" "and" Application "appear 2 times, 35 times and 5 times respectively, then their frequency is 0.002, 0.035 and 0.005. We add these three numbers, and 0.042 is a simple measure of the relevance of the corresponding Web page and the query "Application of atomic energy". In a nutshell, if a query contains keywords w1,w2,..., wN, the word frequency in a particular webpage is: TF1, TF2, ..., TFN. (Tf:term frequency). So, this query and the relevance of the page is: TF1 + TF2 + ... + TFN.
The reader may have discovered another loophole. In the example above, the word "" stands at more than 80% of the total frequency, and it is almost useless for determining the topic of a webpage. We call this word "should be deleted" (stopwords), which means that the measurement of relevance is not the frequency at which they should be considered. In Chinese, we should delete the words "yes", "and", "Zhong", "Land", "get" and so on dozens of. Ignoring these words, the similarity of the above pages becomes 0.007, with "Atomic energy" contributing 0.002, and "application" contributing 0.005. The attentive reader may also find another small loophole. In Chinese, "application" is a very common word, and "atomic energy" is a very professional word, the latter in the relevance ranking is more important than the former. So we need to give a weight to every word in Chinese, and this weight setting must meet the following two conditions:
1. The greater the ability of a word to predict the theme, the greater the weight, and conversely, the less the weight. We see the word "Atomic energy" in the Web page, more or less to understand the theme of the Web page. We see "Apply" once, and the topic is largely unknown. Therefore, the weight of "atomic energy" should be greater than the application.
2. The weight of the word should be deleted should be zero.
We can easily find that if a keyword appears only in a very small number of pages, we can easily lock the search target through it, and its weight should be large. Conversely, if a word appears in a large number of pages, we see that it is still unclear what content to look for, so it should be small. In a nutshell, assuming that a keyword W appears in a DW Web page, the larger the DW, the smaller the W weight, and vice versa. In information retrieval, the most used weight is the inverse text frequency index (inverse document frequency is abbreviated as IDF), and its formula is log (D/DW) where D is the total number of pages. For example, we assume that the number of Chinese pages is D=10 billion, should delete the word "" in all the pages appear, that is, DW=10 billion, then its idf=log (1 billion/1 billion) = log (1) = 0. If the special word "atomic energy" appears in 2 million pages, namely DW=200 million, then its weight idf=log (500) = 6.2. Also assume that the generic word "Application" appears in 500 million pages, its weight IDF = log (2) is only 0.7. It just says that finding a "atomic energy" histograms in the Web page is equivalent to finding a match of nine "apps". Using IDF, the above correlations are calculated by the simple summation of the word frequency into a weighted sum, i.e. TF1*IDF1 + tf2*idf2 + ... + TFN*IDFN. In the example above, the Web page and the "Application of atomic energy" have a correlation of 0.0161, of which "atomic energy" contributed 0.0126, while "application" contributed only 0.0035. This ratio is quite consistent with our intuition.