TF-IDF algorithm
The TF-IDF (Word frequency-inverse document rate) algorithm is a statistical method used to evaluate the importance of a term for one file in a set of files or a corpus. the importance of a word increases in proportion to the number of times it appears in the file, but it decreases inversely as it appears in the Corpus . The algorithm has been widely used in the fields of data mining, text processing and information retrieval, such as finding its key words from an article.
The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification. TF-IDF is actually TF*IDF, in which TF (term Frequency) represents the frequency at which the entry appears in the article Document, and the main idea of the IDF (inverse Document Frequency) is that if a word is included The fewer documents in Word, the greater the word's sensitivity, which is the greater the IDF. For how to get the keyword of an article, we can calculate the tf-idf,tf-idf of all the nouns appearing on this side of the article, then the higher the distinction of the noun to this article, take TF-IDF value a few words, can be used as the key word of this article.
Calculation Steps
Calculate word frequency (TF)
Word frequency = number of occurrences of a term in an article / total number of articles
Calculate inverse document frequency (IDF)
Inverse Document frequency = log (total number of documents in Corpus / (number of documents containing and modifying words + 1)) (10 for bottom)
Calculating Frequency-inverse document frequencies (TF-IDF)
TF-IDF = Word frequency * Inverse document frequencies
Example
Statistics on the word frequency (term Frequency, TF) for "Chinese Bee farming"
The most frequently occurring words are----"," "Yes", "in"----the most commonly used words (discontinued words), not counted in the category of statistics.
Found that the three words "China", "Bee" and "breed" have the same number of occurrences, the importance is the same?
"China" is a very common word, comparatively speaking, "bee" and "breed" are not so common
"Chinese bee farming": assuming that the length of the article is 1000 words, "China", "bee", "culture" each appeared 20 times, then these three words "word frequency" (TF) are 0.02
Suppose the search for Google found that there are 25 billion pages containing the word "", assuming this is the total number of Chinese pages. There are 6.23 billion pages containing "China", with 48.4 million pages containing "bee", and 97.3 million pages containing "culture".
It is seen that bees and farming are more ' critical ' than China's in the document, which is more representative.
Natural language processing--TF-IDF (keyword extraction)