TF-IDF algorithm--Principle and implementation

Source: Internet
Author: User
Tags idf

TF-IDF algorithm is a commonly used weighted technique for information retrieval and data mining. TF means word frequency (term-frequency), the IDF means reverse file frequencies (inverse document frequency).

TF-IDF is a traditional statistical algorithm used to evaluate how important a word is to a document in a document set. It is proportional to the word frequency in the current document, inversely with the other words in the document set.

First of all, the TF (Word frequency) calculation method, TF refers to the current document, the frequency, in this formula, the molecule represents the number of times a change in a document, the denominator represents the sum of the occurrences of all the keywords in the document.

Then, the IDF (reverse frequency) calculation method, IDF refers to a word of the universality of the measure. , in this formula, the portion of the log, the numerator represents the number of documents in the document set, the denominator represents the number of documents containing the current keyword, and for this score logarithm, the current term is the IDF value.

Below, let me introduce the design and implementation of the TF-IDF algorithm through Python:

Object 1: Article Set (properties: Collection of Article objects, number of articles that contain keywords)

Object 1: Article (attribute: A collection of keyword objects; The total number of occurrences of a keyword; a dictionary of the corresponding object of a keyword)

Object 2: Article-keywords (attribute: keyword name; number of occurrences of keywords in the current article; tf_idf)

Implementation process:

1, create the article object, the initial keyword map set

2, traverse the keyword, each traversal of a keyword,

2.1 Total number of keyword occurrences plus one

2.2 Judge the article keyword is enough to exist in the current keyword, if present, find him, plus one, if not present, create an article keyword object, plug into the focus of the article's keywords;

2.3 If the keyword appears for the first time, the number of articles that are logged in the keyword (if the keyword is present in the keyword-article number dictionary, the number of articles is +1, otherwise it is added to the keyword-article number dictionary, and the initial value is assigned 1)

2.4 Traversal completed, the article on the keyword of the map set load completed, and then add the current article to the article set of objects to go

3 traverse the article set, calculate the tf-idf of the keyword, and output

Implementation code: (Implement code to read a file to simulate multiple documents)

# tf_idf.py#-*-coding:utf-8-*-import jiebaimport mathclass documentset (): Documentlist = [] Key_Count = {} #关         Number of articles corresponding to the key word class Document (): dockeysumcount=0 #文章中所有关键词总次数 dockeyset={} #关键词对象列表 def __init__ (self,docid): Self.docid = Docidclass Dockey (): Dockeycount = 1 #当前关键词在当前文章中出现的次数 tf_idf = 0 #当前关键词的TF-IDF value def __init__ (s Elf,word): Self.word = WORDF = Open ("C:/users/zw/desktop/key-words.txt", ' R ') line= ' start ' docList = DocumentSet () whi Le line:line = f.readline () datafile = Line.split (' \ t ') if (datafile.__len__ () >=2): doc = Document (data File[0]) wordList = List (Jieba.cut (datafile[1])) for i in WordList:doc.docKeySumCount = Doc.dock                Eysumcount + 1 If I not in Doc.docKeySet.keys (): doc.dockeyset[i] = Dockey (i) Else: Doc.dockeyset[i].dockeycount = doc.dockeyset[i].dockeycount+1 #记录包含关键词的文章数 if Doc.dock Eyset[i].dockeycount < = 1:if I Not in DocList.key_Count.keys (): Doclist.key_count[i]=1 Else: Doclist.key_count[i]=doclist.key_count[i]+1 DocList.documentList.append (DOC) F.close () for D in Doclis T.documentlist:for K in D.dockeyset.keys (): D.dockeyset[k]. TF_IDF = D.dockeyset[k].dockeycount/d.dockeysumcount + Math.log (doclist.documentlist.__len__ ()/docList.key_Count[k ]) print (' Article ID:%s ', '%s ' TF-IDF value is: '%s ', D.docid, K, D.dockeyset[k]. TF_IDF)

  

TF-IDF algorithm--Principle and implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.