Search Engine Algorithm Research topic Five: TF-IDF detailedDecember 19, 2017 ? Search technology? A total of 1396 characters? small size big ? Comments Off
TF-IDF (term frequency–inverse document frequency) is a commonly used weighted technique for information retrieval and information mining. TF-IDF is a statistical method used to evaluate the importance of a word to one of the files in a set of files or a corpus. The importance of a word increases in proportion to the number of times it appears in a file, but at the same time it decreases inversely with the frequency it appears in the corpus. The various forms of tf-idf weighting are often searched engine applications as a measure or rating of the degree of relevance between a file and a user query. In addition to TF-IDF, the search engine on the Internet uses a rating method based on link analysis to determine the order in which the files appear in the search results.
The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification. TFIDF is actually: TF * IDF,TF Word frequency (term Frequency), IDF Anti-document frequencies (inverse document Frequency). TF represents the entry, in document D, the frequency at which it appears. The main idea of IDF is that if the fewer documents that contain the entry T, that is, the smaller the n, the larger the IDF, the better the class-distinguishing ability of the term T. If a Class C. The number of documents containing the term T is M, and the other class contains t the total number of documents is K, it is clear that all the documents containing T N=m+k, when the GFL is large, n is also large, the IDF formula is obtained by the IDF value will be small, indicating that the term T category is not strong. But in fact, if an entry is frequently present in a document of a class, it indicates that the term is a good representation of the character of the text of the class, and that the entry should give them a higher weight and be chosen as the characteristic word of the text to distinguish it from other classes of documents. This is where the IDF is deficient.
Principle
In a given document, the word frequency (term frequency, TF) refers to the number of occurrences of a given term in the file. This number is usually normalized to prevent it from favoring long files. (The same term may have a higher word frequency than a short document in a long document, regardless of whether the word is important or not.) )
Reverse file frequency (inverse document frequency, IDF) is a measure of the universal importance of a word. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and then obtained by the quotient logarithm.
The high-word frequency within a particular file, and the low file frequency of the word in the entire set of files, can produce high-weight tf-idf. As a result, TF-IDF tends to filter out common words and retain important words.
Example
There are many different mathematical formulas that can be used to calculate TF-IDF. Word frequency (TF) is the number of occurrences of a term divided by the sum of the words in the file. If the total number of words in a document is 100, and the word "cow" appears 3 times, then the word "cow" in the document is 0.03 (3/100). One way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow" and then divided by the total number of files contained in the file set. Therefore, if the term "cow" appeared in 1,000 documents, and the total number of documents was 10,000,000, the file frequency was 0.0001 (1000/10,000,000). Finally, the TF-IDF score can be computed by dividing the word frequency by the file frequencies. In the above example, the term "cow" is tf-in the file set
The IDF score would be 300 (0.03/0.0001). Another form of this formula is to take the file frequency logarithm.
Application in the vector space model
The TF-IDF weight calculation method is often used with the cosine similarity (cosine similarity) in a vector space model to determine the similarity between the two documents.
Search Engine Algorithm Research topic Five: TF-IDF detailed