Topic Center

Contact Sales

Home > Others

Natural language processing--TF-IDF (keyword extraction)

Last Update:2018-07-18 Source: Internet

Author: User

Tags ord idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

TF-IDF algorithm

The TF-IDF (Word frequency-inverse document rate) algorithm is a statistical method used to evaluate the importance of a term for one file in a set of files or a corpus. the importance of a word increases in proportion to the number of times it appears in the file, but it decreases inversely as it appears in the Corpus . The algorithm has been widely used in the fields of data mining, text processing and information retrieval, such as finding its key words from an article.

The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification. TF-IDF is actually TF*IDF, in which TF (term Frequency) represents the frequency at which the entry appears in the article Document, and the main idea of the IDF (inverse Document Frequency) is that if a word is included The fewer documents in Word, the greater the word's sensitivity, which is the greater the IDF. For how to get the keyword of an article, we can calculate the tf-idf,tf-idf of all the nouns appearing on this side of the article, then the higher the distinction of the noun to this article, take TF-IDF value a few words, can be used as the key word of this article.

Calculation Steps

Calculate word frequency (TF)
Word frequency = number of occurrences of a term in an article / total number of articles
Calculate inverse document frequency (IDF)
Inverse Document frequency = log (total number of documents in Corpus / (number of documents containing and modifying words + 1)) (10 for bottom)
Calculating Frequency-inverse document frequencies (TF-IDF)
TF-IDF = Word frequency * Inverse document frequencies

Example　

Statistics on the word frequency (term Frequency, TF) for "Chinese Bee farming"
The most frequently occurring words are----"," "Yes", "in"----the most commonly used words (discontinued words), not counted in the category of statistics.
Found that the three words "China", "Bee" and "breed" have the same number of occurrences, the importance is the same?
"China" is a very common word, comparatively speaking, "bee" and "breed" are not so common

"Chinese bee farming": assuming that the length of the article is 1000 words, "China", "bee", "culture" each appeared 20 times, then these three words "word frequency" (TF) are 0.02
Suppose the search for Google found that there are 25 billion pages containing the word "", assuming this is the total number of Chinese pages. There are 6.23 billion pages containing "China", with 48.4 million pages containing "bee", and 97.3 million pages containing "culture".

It is seen that bees and farming are more ' critical ' than China's in the document, which is more representative.

Natural language processing--TF-IDF (keyword extraction)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

natural language processing book natural language processing software siri natural language processing natural language processing books natural language processing udacity udacity natural language processing best natural language processing books

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Natural language processing--TF-IDF (keyword extraction)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support