TF-IDF extracting article keyword algorithm

Source: Internet
Author: User
Tags key words idf

I. Introduction of TF-IDF

TF-IDF (terms frequency-inverse Document frequency) is a commonly used weighted technique for information retrieval and text mining. TF-IDF is a statistical method used to evaluate how important a word is to an article. The importance of a word to an article depends mainly on the number of times it appears in the document, and the higher the number of occurrences of the word in this article, the more important it is to this article. At the same time, it is related to the number of articles that appear in the corpus, and as the number of posts appears, the importance of the term in this article will be reduced, and the specific algorithm can be seen below.

Second, the implementation of the algorithm

1, before the implementation of this algorithm, we need to an article to participle, in the Chinese word segmentation, the recommendation of a Python library, Jieba participle, the author will publish this project on GitHub, is open source, GitHub address Https://github.com/fxsjy/jieba

2. The calculation of TF word frequency

Word frequency (TF) = number of occurrences of a term in an article

Since we need to consider different articles in different lengths, we need to treat the word frequency as a normalization

Word frequency (TF) = number of occurrences of a word in an article/total number of words in an article or word frequency (TF) = number of occurrences of a word in an article/number of occurrences of the most frequently occurring words in this article

3. Calculation of IDF

Inverse Document frequency (IDF) =log (the total number of corpus documents/documents containing the word +1), the corpus can be downloaded on the Internet itself, the reason for calculating the inverse document frequency is to remove the frequently appearing words, for example, "the", "we", "he" and other such words, These words for the whole document is not very important, but the frequency will be more, it may affect our final calculation results, if it is often appear in the words can not be used as the key words of our article.

4, calculate the value of TF-IDF

TF-IDF = Word frequency (TF) * Inverse document rate (IDF)

5. Sorting

In order to sort the TF-IDF values of the words, we can choose to extract the words with larger TF-IDF values.

6. Summary

The advantages of the TF-IDF algorithm are simple and fast, and the results are more realistic. However, the TF-IDF algorithm is simply the "word frequency" to measure the importance of a term, it is not comprehensive enough, these words may not necessarily reflect the main idea of the article to highlight the theme of the article. Moreover, this algorithm does not reflect the different positions of words in the importance of the article is different, if you want to solve this problem, we can use the words in different places to give them different weights.

three, the test case

The following example is the use of the Jieba library, to implement the TF-IDF algorithm, the following is the content of the article

There are many different mathematical formulas that can be used to calculate TF-IDF. The
example here is calculated using the above mathematical formula.
Word frequency (TF) is the number of occurrences of a term divided by the sum of the words in the file.
if the total number of words in a document is 100, and the word "cow" appears 3 times,
then the word "cow" in the document is 3/100=0.03.
one way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow"
and then divided by the total number of files contained in the file set. Therefore, if the word "cow" appeared in 1,000 documents, and the
total number of documents is 10,000,000, the reverse file frequency is log (10,000,000/1,000) = 4. The
final TF-IDF score is 0.03 * 4=0.12.
Python code

Import sys
sys.path.append ('.. /')

import jieba
import Jieba.analyse from
optparse import optionparser

file_name = ". /txt/test.txt "

content = open (file_name, ' RB '). Read ()

#10表示输出的前10个
tags = jieba.analyse.extract_tags ( Content, topk=10)

print (",". Join (tags))
Output results

000, file, cow, words, TF, Word frequency, 100,idf,10,0.03





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.