Application of TF-IDF and cosine similarity (i) automatic extraction of keywords

Source: Internet
Author: User
Tags idf

This headline seems to be very complicated, in fact, I would like to talk about a very simple question.

There is a very long article, I want to use the computer to extract its keywords (Automatic keyphrase extraction), completely without manual intervention, how can I do it correctly?

This problem involves data mining, text processing, information retrieval and many other computer frontier areas, but unexpectedly, there is a very simple classical algorithm, can give a very satisfactory results. It's simple enough to have no higher mathematics, and the average person can understand it in just 10 minutes, which is the TF-IDF algorithm I want to introduce today.

Let's start with an example. Assuming there is now a long article, "Bee Culture in China," we are going to use the computer to extract its keywords.

An easy way to think is to find the words that appear most frequently. If a word is important, it should appear in this article more than once. So, we do the "word frequency" (Term Frequency, abbreviated as TF) statistics.

As a result, you must have guessed that the words that appear most frequently are the most commonly used words----"," "Yes", "in"----. They are called "Stop words" (stop words), meaning that there is no help in finding the result, words must be filtered out.

Let's say we filter them out and only consider the remaining meaningful words. This will also encounter another problem, we may find that "China", "bee", "culture" these three words appear as many times. Does this mean that, as a keyword, they are of the same importance?

That's obviously not the case. Because "China" is a very common word, "bee" and "culture" are relatively less common. If these three words appear as many times as in an article, there is reason to think that "bees" and "culture" is more important than "China", that is, in the keyword ranking, "Bee" and "culture" should be ranked in front of "China".

So, we need an important adjustment factor to measure whether a word is a common word. If a word is rare, but it appears more than once in this article, it is likely to reflect the nature of the article, the keyword we need.

In the statistical language, it is on the basis of word frequency, to assign a "importance" weight to each term. The most common words ("", "yes", "in") give the smallest weight, the more common words ("China") give smaller weights, the more Rare words ("bee", "culture") give a larger weight. This weight is called "Reverse document Frequency" (Inverse document Frequency, abbreviated to IDF), and its size is inversely proportional to the common degree of a word.

After you know the word frequency (TF) and the "reverse Document Frequency" (IDF), multiply the two values, and you get a TF-IDF value. The higher the importance of a word to an article, the greater its TF-IDF value. So, in the first few words, is the key word of this article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.