TF-IDF algorithm--Principle and implementation

Last Update:2017-11-12 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

TF-IDF algorithm is a commonly used weighted technique for information retrieval and data mining. TF means word frequency (term-frequency), the IDF means reverse file frequencies (inverse document frequency).

TF-IDF is a traditional statistical algorithm used to evaluate how important a word is to a document in a document set. It is proportional to the word frequency in the current document, inversely with the other words in the document set.

First of all, the TF (Word frequency) calculation method, TF refers to the current document, the frequency, in this formula, the molecule represents the number of times a change in a document, the denominator represents the sum of the occurrences of all the keywords in the document.

Then, the IDF (reverse frequency) calculation method, IDF refers to a word of the universality of the measure. , in this formula, the portion of the log, the numerator represents the number of documents in the document set, the denominator represents the number of documents containing the current keyword, and for this score logarithm, the current term is the IDF value.

Below, let me introduce the design and implementation of the TF-IDF algorithm through Python:

Object 1: Article Set (properties: Collection of Article objects, number of articles that contain keywords)

Object 1: Article (attribute: A collection of keyword objects; The total number of occurrences of a keyword; a dictionary of the corresponding object of a keyword)

Object 2: Article-keywords (attribute: keyword name; number of occurrences of keywords in the current article; tf_idf)

Implementation process:

1, create the article object, the initial keyword map set

2, traverse the keyword, each traversal of a keyword,

2.1 Total number of keyword occurrences plus one

2.2 Judge the article keyword is enough to exist in the current keyword, if present, find him, plus one, if not present, create an article keyword object, plug into the focus of the article's keywords;

2.3 If the keyword appears for the first time, the number of articles that are logged in the keyword (if the keyword is present in the keyword-article number dictionary, the number of articles is +1, otherwise it is added to the keyword-article number dictionary, and the initial value is assigned 1)

2.4 Traversal completed, the article on the keyword of the map set load completed, and then add the current article to the article set of objects to go

3 traverse the article set, calculate the tf-idf of the keyword, and output

Implementation code: (Implement code to read a file to simulate multiple documents)

# tf_idf.py#-*-coding:utf-8-*-import jiebaimport mathclass documentset (): Documentlist = [] Key_Count = {} #关         Number of articles corresponding to the key word class Document (): dockeysumcount=0 #文章中所有关键词总次数 dockeyset={} #关键词对象列表 def __init__ (self,docid): Self.docid = Docidclass Dockey (): Dockeycount = 1 #当前关键词在当前文章中出现的次数 tf_idf = 0 #当前关键词的TF-IDF value def __init__ (s Elf,word): Self.word = WORDF = Open ("C:/users/zw/desktop/key-words.txt", ' R ') line= ' start ' docList = DocumentSet () whi Le line:line = f.readline () datafile = Line.split (' \ t ') if (datafile.__len__ () >=2): doc = Document (data File[0]) wordList = List (Jieba.cut (datafile[1])) for i in WordList:doc.docKeySumCount = Doc.dock                Eysumcount + 1 If I not in Doc.docKeySet.keys (): doc.dockeyset[i] = Dockey (i) Else: Doc.dockeyset[i].dockeycount = doc.dockeyset[i].dockeycount+1 #记录包含关键词的文章数 if Doc.dock Eyset[i].dockeycount < = 1:if I Not in DocList.key_Count.keys (): Doclist.key_count[i]=1 Else: Doclist.key_count[i]=doclist.key_count[i]+1 DocList.documentList.append (DOC) F.close () for D in Doclis T.documentlist:for K in D.dockeyset.keys (): D.dockeyset[k]. TF_IDF = D.dockeyset[k].dockeycount/d.dockeysumcount + Math.log (doclist.documentlist.__len__ ()/docList.key_Count[k ]) print (' Article ID:%s ', '%s ' TF-IDF value is: '%s ', D.docid, K, D.dockeyset[k]. TF_IDF)

TF-IDF algorithm--Principle and implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More