I. Introduction of TF-IDF
TF-IDF (terms frequency-inverse Document frequency) is a commonly used weighted technique for information retrieval and text mining. TF-IDF is a statistical method used to evaluate how important a word is to an article. The importance of a word to an article depends mainly on the number of times it appears in the document, and the higher the number of occurrences of the word in this article, the more important it is to this article. At the same time, it is related to the number of articles that appear in the corpus, and as the number of posts appears, the importance of the term in this article will be reduced, and the specific algorithm can be seen below.
Second, the implementation of the algorithm
1, before the implementation of this algorithm, we need to an article to participle, in the Chinese word segmentation, the recommendation of a Python library, Jieba participle, the author will publish this project on GitHub, is open source, GitHub address Https://github.com/fxsjy/jieba
2. The calculation of TF word frequency
Word frequency (TF) = number of occurrences of a term in an article
Since we need to consider different articles in different lengths, we need to treat the word frequency as a normalization
Word frequency (TF) = number of occurrences of a word in an article/total number of words in an article or word frequency (TF) = number of occurrences of a word in an article/number of occurrences of the most frequently occurring words in this article
3. Calculation of IDF
Inverse Document frequency (IDF) =log (the total number of corpus documents/documents containing the word +1), the corpus can be downloaded on the Internet itself, the reason for calculating the inverse document frequency is to remove the frequently appearing words, for example, "the", "we", "he" and other such words, These words for the whole document is not very important, but the frequency will be more, it may affect our final calculation results, if it is often appear in the words can not be used as the key words of our article.
4, calculate the value of TF-IDF
TF-IDF = Word frequency (TF) * Inverse document rate (IDF)
5. Sorting
In order to sort the TF-IDF values of the words, we can choose to extract the words with larger TF-IDF values.
6. Summary
The advantages of the TF-IDF algorithm are simple and fast, and the results are more realistic. However, the TF-IDF algorithm is simply the "word frequency" to measure the importance of a term, it is not comprehensive enough, these words may not necessarily reflect the main idea of the article to highlight the theme of the article. Moreover, this algorithm does not reflect the different positions of words in the importance of the article is different, if you want to solve this problem, we can use the words in different places to give them different weights.
three, the test case
The following example is the use of the Jieba library, to implement the TF-IDF algorithm, the following is the content of the article
There are many different mathematical formulas that can be used to calculate TF-IDF. The
example here is calculated using the above mathematical formula.
Word frequency (TF) is the number of occurrences of a term divided by the sum of the words in the file.
if the total number of words in a document is 100, and the word "cow" appears 3 times,
then the word "cow" in the document is 3/100=0.03.
one way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow"
and then divided by the total number of files contained in the file set. Therefore, if the word "cow" appeared in 1,000 documents, and the
total number of documents is 10,000,000, the reverse file frequency is log (10,000,000/1,000) = 4. The
final TF-IDF score is 0.03 * 4=0.12.
Python code
Import sys
sys.path.append ('.. /')
import jieba
import Jieba.analyse from
optparse import optionparser
file_name = ". /txt/test.txt "
content = open (file_name, ' RB '). Read ()
#10表示输出的前10个
tags = jieba.analyse.extract_tags ( Content, topk=10)
print (",". Join (tags))
Output results
000, file, cow, words, TF, Word frequency, 100,idf,10,0.03