TF-IDF and its algorithm
TF-IDF (term frequency–inverse document frequency) is a commonly used weighted technique for information retrieval and information mining. TF-IDF is a statistical method used to evaluate the importance of a word to one of the files in a set of files or a corpus. the importance of a word increases in proportion to the number of times it appears in the file, but it decreases inversely as it appears in the Corpus . The various forms of tf-idf weighting are often searched engine applications as a measure or rating of the degree of relevance between a file and a user query. In addition to TF-IDF, the search engine on the Internet uses a rating method based on link analysis to determine the order in which the files appear in the search results.
In a given document, the word frequency (term frequency, TF) refers to the number of occurrences of a given term in the file. This number is usually normalized (the numerator is generally less than the denominator differs from the IDF)to prevent it from favoring long files. (The same term may have a higher word frequency than a short document in a long document, regardless of whether the word is important or not.) ）
Reverse file frequency (inverse document frequency, IDF) is a measure of the universal importance of a word. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and then obtained by the quotient logarithm.
The high-word frequency within a particular file, and the low file frequency of the word in the entire set of files, can produce high-weight tf-idf. As a result, TF-IDF tends to filter out common words and retain important words.
The main idea of TFIDF is: If a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, the term or phrase is considered to be of a good classification ability and is suitable for categorization.TFIDF is actually: TF * IDF,TF Word frequency (term Frequency), IDF Anti-document frequencies (inverse document Frequency). TFIndicates how often the entry appears in document D (another says:tf Word frequency (term Frequency) refers tois theThe number of occurrences of a given word in the file）。 The main idea of IDF was: iffewer documents containing the entry T, alsois that the smaller the n, the bigger the IDF, the term T has a good classification ability. If a class of document Cthe number of documents containing the entry T is M, whileThe total number of documents with T in other classes is KAnd apparently all of the documents containing T n=m+k, whenm LargeThe time,N also large, as obtained by the IDF formulathe value of the IDF willSmall, it means that the term T classification ability is not strong. (another says:IDF Anti-document frequency (inverse document Frequency)refers to the fewer documents that contain entries, the greater the IDF, the better the class-distinguishing ability of the entry. But in fact, if an entry is frequently present in a document of a class, it indicates that the term is a good representation of the character of the text of the class, which should be given a higher weight and selected as the characteristic word of the text to distinguish it from other classes of documents. This is where the IDF is deficient.
In a given document, the word frequency (term FREQUENCY,TF) refers to how often a given term appears in the file. This number is normalized to the number of words (term count) to prevent it from favoring long files. (The same word may have a higher number of words in a long document than a short document, regardless of whether the word is important or not.) For words in a particular document, the importance of it can be expressed as:
The above is the number of occurrences of the word in a file , and the denominator is the sum of the occurrences of all words in the file .
The Inverse file frequency (inverse document FREQUENCY,IDF) is a measure of the universal importance of a word. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and the obtained quotient logarithm is obtained:
- | d|: total number of files in Corpus
- : The number of files that contain words (that is, the number of files) if the term is not in the corpus, it causes the divisor to be zero, so it is generally used
The high-word frequency within a particular file, and the low file frequency of the word in the entire set of files, can produce high-weight tf-idf. As a result,TF-IDF tends to filter out common words and retain important words .
A: There are many different mathematical formulas that can be used to calculate TF-IDF. The example here is calculated using the above mathematical formula. Word frequency (TF) is the number of occurrences of a term divided by the sum of the words in the file. If the total number of words in a document is 100, and the word "cow" appears 3 times, then the word "cow" in the document is 3/100=0.03. One way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow" and then divided by the total number of files contained in the file set. Therefore, if the word "cow" appeared in 1,000 documents, and the total number of documents is 10,000,000, the reverse file frequency is log (10,000,000/1,000) = 4. The final TF-IDF score is 0.03 * 4=0.12.
Second: the relevance of the search results based on the keyword K1,K2,K3 becomes tf1*idf1 + tf2*idf2 + tf3*idf3. For example, the total term of Document1 for 1000,K1,K2,K3 in Document1 the number of occurrences is 100,200, 50. Contains the K1, K2, K3 docuement Total is 1000, 10000,5000. The total of document set is 10000. TF1 = 100/1000 = 0.1 TF2 = 200/1000 = 0.2 TF3 = 50/1000 = 0.05 IDF1 = log (10000/1000) = log (Ten) = 2.3 IDF2 = log (10000/100 = log (1) = 0; IDF3 = log (10000/5000) = log (2) = 0.69 such a keyword K1,K2,K3 related to docuement1 = 0.1*2.3 + 0.2*0 + 0.05*0.69 = 0.2645 where K1 than K3 The proportion of Document1 to large, the proportion of K2 is 0.
Three: In a 1000-word page, "Atomic energy", "" "and" Application "appear 2 times, 35 times and 5 times respectively, then their frequency is 0.002, 0.035 and 0.005. We add these three numbers, and 0.042 is a simple measure of the relevance of the corresponding Web page and the query "Application of atomic energy". In a nutshell, if a query contains keywords w1,w2,..., wN, the word frequency in a particular webpage is: TF1, TF2, ..., TFN. (Tf:term frequency). So, this query and the relevance of the page is: TF1 + TF2 + ... + TFN.
The reader may have discovered another loophole. In the example above, the word "" stands at more than 80% of the total frequency, and it is almost useless for determining the topic of a webpage. We call this word "should be deleted" (stopwords), which means that the measurement of relevance is not the frequency at which they should be considered. In Chinese, we should delete the words "yes", "and", "Zhong", "Land", "get" and so on dozens of. Ignoring these words, the similarity of the above pages becomes 0.007, with "Atomic energy" contributing 0.002, and "application" contributing 0.005. The attentive reader may also find another small loophole. In Chinese, "application" is a very common word, and "atomic energy" is a very professional word, the latter in the relevance ranking is more important than the former. So we need to give a weight to every word in Chinese, and this weight setting must meet the following two conditions:
1. The greater the ability of a word to predict the theme, the greater the weight, and conversely, the less the weight. We see the word "Atomic energy" in the Web page, more or less to understand the theme of the Web page. We see "Apply" once, and the topic is largely unknown. Therefore, the weight of "atomic energy" should be greater than the application.
2. The weight of the word should be deleted should be zero.
It is easy to see that if a keyword appears only in a very small number of pages, we can easily lock the search target through it, and its weight should be large . Conversely, if a word appears in a large number of pages , we see that it is still unclear what content to look for, so it should be small. In a nutshell, assuming that a keyword W appears in a DW Web page, the larger the DW, the smaller the W weight, and vice versa . In information retrieval, the most used weight is the inverse text frequency index (inverse document frequency is abbreviated as IDF), and its formula is log (D/DW) where D is the total number of pages. For example, we assume that the number of Chinese pages is D=10 billion, should delete the word "" in all the pages appear, that is, DW=10 billion, then its idf=log (1 billion/1 billion) = log (1) = 0. if the special word "atomic energy" appears in 2 million pages, namely DW=200 million, then its weight idf=log (500) = 6.2. Also assume that the generic word "Application" appears in 500 million pages, its weight IDF = log (2) is only 0.7. It just says that finding a "atomic energy" histograms in the Web page is equivalent to finding a match of nine "apps" . Using IDF, the above correlations are calculated by the simple summation of the word frequency into a weighted sum, i.e. TF1*IDF1 + tf2*idf2 + ... + tfn*idfn. In the example above, the Web page and the "Application of atomic energy" have a correlation of 0.0161, of which "atomic energy" contributed 0.0126, while "application" contributed only 0.0035. This ratio is quite consistent with our intuition.
Original address: http://www.cnblogs.com/biyeymyhjob/archive/2012/07/17/2595249.html
TF-IDF and its algorithm