TF–IDF Algorithm Python code implementation
This is the core part of a TF-IDF I wrote the code, not the complete implementation, of course, the rest of the matter is very simple, we know TFIDF=TF*IDF, so we can calculate the TF and IDF values are multiplied, first we create a simple corpus, as an example, only four words, Each sentence represents a document
copus=[' I am learning computer ', ' It is eating ', ' my book is still there ', ' don't work today '
Because the Chinese need participle, jieba participle is more useful in Python word breaker tool, so the choice of Jieba participle, the end of the text is Jieba link. First, the document is participle:
Import jiebacopus=[' I am learning computer ', ' It is eating ', ' my book is still there ', ' not working today ']copus= [[Word for Word in jieba.cut (DOC)] for doc in Copus] Print (Copus)
Output Result:
[' I ', ' being ', ' learning ', ' computer '], [' It ', ' being ', ' eating '], [' I ', ' ', ' ' book ', ' Still ', ' in ', ' You ', ' there '], [' Today ', ' no ', ' work ']]
The document becomes the format we want, then begins the word frequency statistic, calculates the TF value, here uses the counter class to convert each document to the words and the frequency of the dictionary, actually already obtained the TF value
tf = []for doc in Copus:tf.append (Counter (DOC)) print (TF)
Output Result:
[Counter ({' I ': 1, ' being ': 1, ' learning ': 1, ' Computer ': 1}), Counter ({' It ': 1, ' being ': 1, ' Eating ': 1}), Counter ({': 1, ' book ': 1, ' You ': 1, ' in ': 1, ' There ': 1, ' I ': 1, ' also ': 1} ', Counter ({' Today ': 1, ' no ': 1, ' Work ': 1})]
Calculate IDF values
Import Mathfrom Collections Import DEFAULTDICTIDF = defaultdict (int) for DOC in TF: for word in doc: Idf[word] + = 1 For word in IDF: Idf[word] = Math.log (len (IDF)/(idf[word]+1)) print (IDF)
Output Result:
Defaultdict (<class ' int ';, {': 2.0149030205422647, ' being ': 1.6094379124341003, ' learning ': 2.0149030205422647, ' computer ': 2.0149030205422647, ' Today ': 2.0149030205422647, ' book ': 2.0149030205422647, ' there ': 2.0149030205422647, ' it ': 2.0149030205422647, ' no ': 2.0149030205422647, ' in ': 2.0149030205422647, ' eat ': 2.0149030205422647, ' I ': 1.6094379124341003, ' You ': 2.0149030205422647, ' also ': 2.0149030205422647, ' Work ': 2.0149030205422647})
The rest is simple, just multiply the TF with the IDF.
TF–IDF algorithm interpretation and implementation of Python code (bottom)