1. TF-IDF (Term Frequency-inverse Document Frequency, Term Frequency-inverse file frequency)
2. self-understanding:
Formula TF =$ \ frac {Number of keywords in the corpus }{ total number of words }$ ## weight W (Term Frequency)
Or
TF =$ $ \ frac {number of times a word appears in the article} {maximum number of times a word appears in the article} $
IDF =$ $ log \ frac {total number of documents} {number of times a file (document) keyword appears + 1 }$ ## total number of documents. Multiple files
TF-IDF = TF * IDF # Word Frequency-inverse document Word Frequency * inverse document Word Frequency
3. Steps for Algorithm Implementation:
1) Word Segmentation
2) number of files
3. Python Algorithm Implementation: jieba
4. hanlp implementation
5. nltk implementation
6. Implementation of scikit-learn
4. Application scenarios:
Principle: 53728499
Principle and Application of TF-IDF