Application of similarity between TF-IDF and Cosine (I): automatic extraction of keywords

Source: Internet
Author: User
Tags idf

Reprinted from http://www.ruanyifeng.com/blog/

This title seems very complicated. In fact, I want to talk about a very simple question.

There is a long article. I want to use a computer to extract its key words (automatic keyphrase extraction) without manual intervention. How can I do it correctly?

This problem involves many cutting-edge computer fields such as data mining, text processing, and Information Retrieval. However, unexpectedly, there is a very simple classical algorithm that can provide satisfactory results. It is simple to do not need advanced mathematics, ordinary people only 10 minutes can understand, this is what I want to introduce today TF-IDF algorithm.

Let's start with an instance. Assume that there is a long article "Bee farming in China", and we are going to use a computer to extract its keywords.

An easy-to-think idea is to find the words that appear the most frequently. If a word is important, it should appear multiple times in this article. Therefore, we conduct term frequency (TF) statistics.

As you can see, the most frequently used words are ---- "," yes ", and" in. They are called "Stop Words", indicating words that are not helpful for finding results and must be filtered out.

Let's assume that we have filtered them out, and only consider the remaining meaningful words. In this way, we may encounter another problem. We may find that the three words "China", "Bee", and "breeding" appear as many times. Does this mean that, as keywords, they are of the same importance?

Obviously not. Because "China" is a common word, "bees" and "aquaculture" are relatively less common. If these three words appear as many as once in an article, it is reasonable to think that the importance of "Bee" and "farming" is greater than that of "China", that is, in keyword sorting, "bees" and "breeding" should be placed before "China.

Therefore, we need to adjust the importance coefficient to determine whether a word is a common word. If a word is rare, but it appears many times in this article, it probably reflects the characteristics of this article, which is exactly what we need.

When expressed in statistical language, a "importance" weight should be assigned to each word based on the term frequency. The most common words ("," is "," in ") give the minimum weight, and the more common words (" China ") give a smaller weight, relatively rare words ("bees", "farming") give a greater weight. This weight is called "inverse Document Frequency" (IDF). Its size is inversely proportional to the degree of common occurrence of a word.

After knowing "Word Frequency" (TF) and "inverse Document Frequency" (IDF), multiply these two values to get the TF-IDF value of a word. The more important a word is to an article, the greater its TF-IDF value. Therefore, the first few words are the keywords of this article.

The following is the details of this algorithm.

Step 1: Calculate the word frequency.

Considering the length of the article, in order to facilitate the comparison of different articles, the word frequency should be standardized.

Or

Step 2: Calculate the inverse document frequency.

In this case, a corpus (corpus) is required to simulate the language use environment.

If a word is more common, the larger the denominator is, the smaller the frequency of the inverse document is, the closer it is to 0. The reason for adding 1 to the denominator is to avoid the denominator being 0 (that is, all documents do not contain this word ). Log indicates the logarithm of the obtained value.

Step 3: Calculate the TF-IDF.

It can be seen that the TF-IDF is proportional to the number of occurrences of a word in the document, and is inversely proportional to the number of occurrences of the word in the entire language. Therefore, the algorithm for Automatically Extracting keywords is very clear, that is, to calculate the TF-IDF value of each word in the document, and then sort in descending order, take the first few words.

Taking the Chinese bee farming as an example, assuming that the length of this article is 1000 words, "China", "Bee", and "breeding" appear 20 times each, the word frequency (TF) is 0.02. Then, search for Google and find that there are a total of 25 billion web pages containing the word ", which is assumed to be the total number of Chinese Web pages. There are a total of 6.23 billion web pages including "China" and 0.0484 billion web pages including "Bee" and 0.0973 billion web pages including "breeding. Their inverse Document Frequency (IDF) and TF-IDF are as follows:

As can be seen from the table above, "Bee" has the highest TF-IDF value, "breeding" second, "China" has the lowest. (If you still calculate the TF-IDF of the word ", it would be a value extremely close to 0 .) Therefore, if you select only one word, "Bee" is the keyword of this article.

In addition to automatically extracting keywords, TF-IDF algorithms can also be used in many other places. For example, in information retrieval, for each document, you can calculate the TF-IDF of a group of search words ("China", "Bee", "farming"), add them, you can get the TF-IDF of the entire document. The document with the highest value is the most relevant to the search term.

The advantage of TF-IDF algorithm is that it is simple and fast, and the result is more in line with the actual situation. The disadvantage is that the importance of a word is measured by word frequency, which is not comprehensive enough. Sometimes important words may appear less frequently. Moreover, this algorithm cannot reflect the location information of words. The words with the top position and those with the back position are considered to be of the same importance, which is incorrect. (One solution is to give a greater weight to the first section of the full text and the first sentence of each section .)

Next time, I will use TF-IDF combined with Cosine similarity to measure the degree of similarity between documents.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.