Original link: http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
The headline seems to be complicated, but what I'm going to talk about is a very simple question.
There is a very long article, I want to use the computer to extract its keywords (Automatic keyphrase extraction), completely without human intervention, how can I do it correctly?
This problem involves data mining, text processing, information retrieval and many other computer frontiers, but surprisingly, there is a very simple classical algorithm, can give a very satisfactory result. It's simple enough to not need advanced mathematics, the average person can only use 10 minutes to understand, this is what I want to introduce today TF-IDF algorithm.
Let's start with an example. Suppose there is now a long article "China's bee farming", we are ready to use a computer to extract its key words.
An easy-to-think idea is to find the most frequently occurring words. If a word is important, it should appear more than once in this article. Therefore, we carry out "word frequency" (term Frequency, abbreviated as TF) statistics.
As a result, you must have guessed that the most frequently used words are----"," "Yes", "in"----. They are called "Stop words" (stop words), meaning words that are not helpful for finding results and must be filtered out.
Let's say we filter them all out, just consider the rest of the words that are actually meaningful. This will lead to another problem, and we may find that the three words "China", "Bee" and "breed" appear as many times as possible. Does this mean that, as a key word, they are of the same importance?
Obviously it's not. Because "China" is a very common word, "bee" and "breed" are relatively less common. If the three words in an article appear the same number of times, there is reason to think, "bee" and "culture" is more important than "China", that is, in the keyword sort above, "bee" and "breeding" should be ranked in front of "China".
So, we need an important adjustment factor to measure whether a word is a common word. If a word is rare, but it appears more than once in this article, it is likely to reflect the nature of the article, the key word we need.
The expression of statistical language is that on the basis of the word frequency, we should assign a weight of "importance" to each term. The most common words ("the", "Yes", "in") give the smallest weight, the more common words ("China") give a smaller weight, the less rare words ("bee", "breed") give greater weight. This weight is called "Inverse document Frequency" (Inverse documents Frequency, abbreviated as IDF), whose size is inversely proportional to the common degree of a word.
Once you know the word frequency (TF) and inverse document frequency (IDF), multiplying the two values, you get the TF-IDF value of a term. The higher the importance of a word to an article, the greater its TF-IDF value. So, in the first few words, is the key word of this article.
Here is the details of the algorithm.
The first step is to calculate the word frequency.
Considering the article has the length of the points, in order to facilitate the comparison of different articles, the "Word frequency" standardization.
Or
The second step is to calculate the inverse document frequency.
At this point, a corpus (corpus) is needed to simulate the language's usage environment.
If a word is more common, the greater the denominator, the less the inverse document frequency is closer to 0. The denominator is added 1 to avoid a denominator of 0 (that is, all documents do not contain the word). Log indicates the logarithm of the resulting value.
The third step is to calculate the TF-IDF.
As you can see, TF-IDF is proportional to the number of occurrences of a word in the document, in inverse proportion to the number of occurrences of the word in the entire language. Therefore, the algorithm of automatic extraction of keywords is very clear, is to calculate the TF-IDF value of each word of the document, and then in descending order, take the first few words.
In the case of "Chinese bee culture", it is assumed that the length of the article is 1000 words, "Chinese", "Bee" and "breed" appear 20 times, then the word frequency (TF) of these three words is 0.02. Then, Google found that there are 25 billion pages containing the word "", assuming this is the total number of Chinese pages. There are 6.23 billion pages containing "China", with 48.4 million pages containing "bee", and 97.3 million pages containing "culture". Their inverse document frequency (IDF) and TF-IDF are as follows:
As can be seen from the table above, "bees" have the highest TF-IDF value, "breed" and secondly, "China" is the lowest. (If you also calculate the "TF-IDF" of the word, it will be an extremely close to 0 value.) So, if you choose only one word, "bee" is the key word of this article.
In addition to extracting keywords automatically, the TF-IDF algorithm can also be used in many other places. For example, in the information retrieval, for each document, you can calculate a group of search terms ("China", "Bee", "breed") TF-IDF, add them, you can get the entire document TF-IDF. The document with the highest value is the one that is most relevant to the search term.
The advantages of the TF-IDF algorithm are simple and fast, and the results are more realistic. The disadvantage is that simply by "word frequency" to measure the importance of a term, not comprehensive enough, sometimes important words may appear not many times. Moreover, this algorithm can not reflect the position of the word, the occurrence of the position of the word and the occurrence of the post-position of the word, are considered to be the same importance, this is not true. (One solution is to give a larger weight to the first sentence of the first paragraph and each paragraph of the text.) )
Next time, I'll use TF-IDF to combine cosine similarity to measure the similarity between documents.
[To] application of TF-IDF and cosine similarity (i): Automatic extraction of keywords