Natural language processing--TF-IDF Algorithm extraction keyword _ natural language processing

Source: Internet
Author: User
Tags idf
Natural language Processing--TF-IDF algorithm to extract key words

This headline seems to be very complicated, in fact, I would like to talk about a very simple question.

There is a very long article, I want to use the computer to extract its keywords (Automatic keyphrase extraction), completely without manual intervention, how can I do it correctly.

This problem involves data mining, text processing, information retrieval and many other computer frontier areas, but unexpectedly, there is a very simple classical algorithm, can give a very satisfactory results. It's simple enough to have no higher mathematics, and the average person can understand it in just 10 minutes, which is the TF-IDF algorithm I want to introduce today.

Let's start with an example. Assuming there is now a long article, "Bee Culture in China," we are going to use the computer to extract its keywords.

An easy way to think is to find the words that appear most frequently. If a word is important, it should appear in this article more than once. So, we do the "word frequency" (Term Frequency, abbreviated as TF) statistics.

As a result, you must have guessed that the words that appear most frequently are the most commonly used words----"," "Yes", "in"----. They are called "Stop words" (stop words), meaning that there is no help in finding the result, words must be filtered out.

Let's say we filter them out and only consider the remaining meaningful words. This will also encounter another problem, we may find that "China", "bee", "culture" these three words appear as many times. Does this mean that, as a keyword, they are of the same importance.

That's obviously not the case. Because "China" is a very common word, "bee" and "culture" are relatively less common. If these three words appear as many times as in an article, there is reason to think that "bees" and "culture" is more important than "China", that is, in the keyword ranking, "Bee" and "culture" should be ranked in front of "China".

So, we need an important adjustment factor to measure whether a word is a common word. If a word is rare, but it appears more than once in this article, it is likely to reflect the nature of the article, the keyword we need.

In the statistical language, it is on the basis of word frequency, to assign a "importance" weight to each term. The most common words ("", "yes", "in") give the smallest weight, the more common words ("China") give smaller weights, the more Rare words ("bee", "culture") give a larger weight. This weight is called "Reverse document Frequency" (Inverse document Frequency, abbreviated to IDF), and its size is inversely proportional to the common degree of a word.

After you know the word frequency (TF) and the "reverse Document Frequency" (IDF), multiply the two values, and you get a TF-IDF value. The higher the importance of a word to an article, the greater its TF-IDF value. So, in the first few words, is the key word of this article.

Here is the details of the algorithm.

The first step is to calculate the word frequency.

Considering the length of the article, in order to facilitate the comparison of different articles, the "Word frequency" standardization.

Or

The second step is to calculate the inverse document frequency.

At this point, a corpus (corpus) is needed to simulate the use environment of the language.

The more common the word, the greater the denominator, the smaller the reverse document frequency, the closer to 0. The denominator is added 1 to avoid the denominator of 0 (that is, all documents do not contain the word). Log represents the logarithm of the resulting value.

The third step is to calculate TF-IDF.

As you can see, TF-IDF is proportional to the number of occurrences of a word in the document, and inversely to the number of occurrences of the word in the entire language. Therefore, the automatic extraction of keyword algorithm is very clear, is to calculate the document of each word TF-IDF value, and then in descending order, take the first few words.

The "Chinese bee culture" as an example, assuming that the length of the article 1000 words, "China", "bee", "culture" appeared 20 times, then the three words "word frequency" (TF) are 0.02. Then, Google found that there are 25 billion pages containing the word "," assuming that this is the total number of Chinese pages. There are 6.23 billion pages containing "China", 48.4 million pages containing "bees", and 97.3 million pages containing "culture". Their reverse document frequency (IDF) and TF-IDF are as follows:

From the above table can be seen, "bee" tf-idf the highest value, "culture" second, "China" the lowest. (If you also calculate the TF-IDF of the word "," that would be a value that is extremely close to 0.) So, if you only choose one word, "bee" is the key word for this article.

In addition to automatically extracting keywords, the TF-IDF algorithm can also be used in many other places. For example, information retrieval, for each document, can calculate a set of search terms ("China", "bee", "culture") of the TF-IDF, add them together, you can get the entire document TF-IDF. The document with the highest value is the one that is most relevant to the search term.

The advantages of the TF-IDF algorithm are simple and fast, and the results are in accordance with the actual situation. The disadvantage is that simply using "word frequency" to measure the importance of a term, not comprehensive enough, sometimes important words may appear not many times. Moreover, this algorithm cannot embody the position information of the word, and it is not correct that the word appearing in the position is in the same importance as the first word in the position. (One solution is to give a larger weight to the first sentence of the full text and to each paragraph.) )


Original link: http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.