TF–IDF algorithm interpretation and implementation of Python code (bottom)

Source: Internet
Author: User
Tags idf

TF–IDF Algorithm Python code implementation

This is the core part of a TF-IDF I wrote the code, not the complete implementation, of course, the rest of the matter is very simple, we know TFIDF=TF*IDF, so we can calculate the TF and IDF values are multiplied, first we create a simple corpus, as an example, only four words, Each sentence represents a document

copus=[' I am learning computer ', ' It is eating ', ' my book is still there ', ' don't work today '

Because the Chinese need participle, jieba participle is more useful in Python word breaker tool, so the choice of Jieba participle, the end of the text is Jieba link. First, the document is participle:

Import jiebacopus=[' I am learning computer ', ' It is eating ', ' my book is still there ', ' not working today ']copus= [[Word for Word in jieba.cut (DOC)] for doc in Copus] Print (Copus)

Output Result:

[' I ', ' being ', ' learning ', ' computer '], [' It ', ' being ', ' eating '], [' I ', ' ', ' ' book ', ' Still ', ' in ', ' You ', ' there '], [' Today ', ' no ', ' work ']]

The document becomes the format we want, then begins the word frequency statistic, calculates the TF value, here uses the counter class to convert each document to the words and the frequency of the dictionary, actually already obtained the TF value

tf = []for doc in Copus:tf.append (Counter (DOC)) print (TF)

Output Result:

[Counter ({' I ': 1, ' being ': 1, ' learning ': 1, ' Computer ': 1}), Counter ({' It ': 1, ' being ': 1, ' Eating ': 1}), Counter ({': 1, ' book ': 1, ' You ': 1, ' in ': 1, ' There ': 1, ' I ': 1, ' also ': 1} ', Counter ({' Today ': 1, ' no ': 1, ' Work ': 1})]

Calculate IDF values

Import Mathfrom Collections Import DEFAULTDICTIDF = defaultdict (int) for DOC in TF: for    word in doc:        Idf[word] + = 1 For word in IDF:    Idf[word] = Math.log (len (IDF)/(idf[word]+1)) print (IDF)

Output Result:

Defaultdict (<class ' int ';, {': 2.0149030205422647, ' being ': 1.6094379124341003, ' learning ': 2.0149030205422647, ' computer ': 2.0149030205422647, ' Today ': 2.0149030205422647, ' book ': 2.0149030205422647, ' there ': 2.0149030205422647, ' it ': 2.0149030205422647, ' no ': 2.0149030205422647, ' in ': 2.0149030205422647, ' eat ': 2.0149030205422647, ' I ': 1.6094379124341003, ' You ': 2.0149030205422647, ' also ': 2.0149030205422647, ' Work ': 2.0149030205422647})

The rest is simple, just multiply the TF with the IDF.

TF–IDF algorithm interpretation and implementation of Python code (bottom)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.