TF-IDF algorithm (1)-Overview of algorithms

Source: Internet
Author: User
Tags idf

Suppose now there is a very long article, to extract its keywords from it, completely without human intervention, then how to do it? It is similar to how to judge the similarity of the two articles, which is a frequently encountered problem in data mining and information retrieval, however, the TF-IDF algorithm can be solved. These two days because to use this algorithm, first learn to understand.

TF-IDF Overview

In contact with a new algorithm, the first of course is to understand the nature of the algorithm, in this, we first quoted the explanation of Baidu Encyclopedia: TF-IDF (term frequency–inverse document Frequency) is a commonly used weighted technique for information retrieval and data mining. To assess the importance of a term for one of the file sets or one of the corpora. The importance of a word increases in proportion to the number of times it appears in the file, but it decreases inversely as it appears in the corpus. The main idea is that if a word or phrase appears in an article with a high frequency of TF and rarely appears in other articles, it is considered that the word or phrase has a good classification ability and is suitable for categorization, and can be used as the keyword mentioned above.

So we probably have some understanding of this algorithm, at least know that it is weighted to determine the importance of the word for the article, then how to implement the algorithm? Down we take a step-by-stage study:

Word frequency (TF) and inverse document frequencies IDF

first of all, even if the name of the law, of course, you will be curious about what is the TF here, what is IDF. now go back to the question we mentioned before, in a very long article looking for keywords (words), general understanding, if a word for the article is very critical, then the number of occurrences is more, so we use "word frequency" (term freqency) statistics, The word frequency here is TF.

So you're going to say that words like "," "Yes," are supposed to have the most number of occurrences, they're called deactivation words, they're completely useless for finding results, and we have to filter out the words,

Assuming we now filter out all those words, we will encounter a problem, assuming we are now looking for keywords in a clustering article. We may find that "clustering" and "algorithm" appear as many times as possible, so are they the same importance? The answer is of course negative, compared to "clustering", "algorithm" more common, the same number of occurrences, we have reason to think that "clustering" is more important than "algorithm." It can also be understood that if a word is relatively rare, but it appears in this article many times, it is likely to reflect the characteristics of this article, but also as we are looking for the keyword.

The idea of an algorithm such as the analytic hierarchy process can give each word a specific weight, such as the most common words given a very small weight, the corresponding rarer words give greater weight, this weight here is called "Inverse document Frequency" (Inverse doucument Frequency, abbreviated as IDF ), whose size is inversely proportional to the common degree of a word. The TF-IDF value is multiplied by the word frequency tf and the inverse document rate IDF, the higher the value, the greater the importance of the term to the article.

Steps

(1) Calculating the word frequency

Word frequency = Total number of occurrences of a term in an article

Of course, in order to eliminate the differences between the size of different articles to facilitate comparisons between different articles, we standardize the word frequency here:

Word frequency = Total number of occurrences of a term in an article/total number of words in an article

or: Word frequency = The total number of occurrences of a word in an article/the number of words that appear most frequently in an article  

(2) Calculating inverse document frequency

First, a corpus is needed to simulate the language's use of the environment.

Inverse document frequency (IDF) = log (total number of documents in the thesaurus/number of documents containing the word +1)

To avoid a denominator of 0, add 1 to the denominator.

(3) Calculate TF-IDF value

Based on the previous analysis, there are: TF-IDF values = TF * IDF.

Here: the TF-IDF value is proportional to the frequency of occurrence of the word and inversely proportional to the number of occurrences in the corpus, in line with the previous analysis.

(4) Find out the key words

After calculating the TF-IDF value of each word in the article, the order is sorted, and the highest value is chosen as the keyword.

(5) Calculate the similarity of the article

Calculate the key words of each article, choose the same number of keywords from each, merge into a set, calculate each article for the word frequency of the set, generate two articles of the frequency vector, and then the Euclidean distance or cosine distance to find the cosine similarity of two vectors, the larger the value is more similar.

Advantages and Disadvantages

1. The advantage is that the algorithm is easy to understand and easy to implement.

2. Cons: The simple structure of IDF does not effectively reflect the importance of the word and the distribution of the characteristics of the word, so that it can not do a good job of adjusting the weight of the function, so to a certain extent, the accuracy of the algorithm is not very high. In addition, the algorithm does not reflect the location of information, for the words appearing in different positions in the article are all the same, and we know that the words in the end of the article is bound to be of relative high importance. As a result, we may also be able to assign different weights to words in different positions in the article.

TF-IDF algorithm (1)-Overview of algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.