The TF-IDF algorithm of the beauty of mathematics

Source: Internet
Author: User
Tags idf

the TF-IDF algorithm of the beauty of mathematics


by white Shinhuata (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.


In "The beauty of Mathematics", Dr. Wu mentioned how to use the TF-IDF algorithm to determine the relevance of Web pages and queries. I'm here to give a note of my own study.


Related name:

TF-IDF (term frequency–inversedocument frequency) is a commonly used weighted technique for information retrieval and information mining.

Tf:term Frequency Word Frequency

Idf:inverse Documentfrequency Rewind Document Frequency


Questions:

Users on Google search, how can Google ensure that the results of the return is the user want it? That is, the user input "xxx", Google how to ensure that the return page and "xxx" associated, but also the "xxx" the most relevant web pages in the first few results?


Analysis:

For example, user input "Atomic energy application", then, it can be easily thought that the page contains these three words more pages will be more than include them less than the page is more relevant. However, this is not fair, because some Web pages are very long, they may contain more keywords (but it is not necessarily more relevant than short pages and this content). So by the number of keywords to compare is not reliable, so we change a way to see the frequency of their appearance.


TF:

Frequency calculation method: The number of occurrences of the page's keywords divided by the total word count of the page. We refer to this quotient as TF (the frequency of the keyword or the frequency of the single text). For example, there are 1000 words in a Web page, where "Atomic energy", "" "," Application "appear 2 times, 35 times and 5 times respectively, then their word frequency is (2/1000) = 0.02, (35/1000) = 0.035, (5/1000) = 0.05, the three word frequency combined and 0.042 is the TF (single-text frequency) of this web page relative to the "Atomic energy Application" keyword.

With the word frequency, we have a simple and convenient way to calculate the keyword of a query and the relevance of a particular Web page. Of course, the user can enter several keywords, their word frequency (TF) in a particular Web page, respectively:. So, the relevance of these keywords to the page is:

However, there are still loopholes, in the example given above, "the" this time accounted for more than 80% of the word frequency, and it is almost useless to determine the theme of the page. So we should ignore these words and cancel out their effects. (Note: We call this word "Stop word" (Stopword), in English there is a, the, or other articles, prepositions, conjunctions, etc., there are "and" in Chinese, "Medium", "get" and so on). Ignoring these stop words, the relevance of the above page becomes 0.007, where "atomic energy" contributed 0.002, "Application" contributed 0.005.

Here, the attentive reader may find another loophole, in Chinese, "application" is a very common word, and "atomic energy" is a very professional word, the latter in the Web page in the relevance of the ranking should occupy a larger proportion. The word frequency we calculate earlier is assumed to be "atomic", and the weight of "applied" is calculated under the same conditions, so we need to improve it.

We give a weight to each word in Chinese, and this weight must be set to meet two conditions:

1> the stronger the ability to predict a topic (the degree of relevance to the subject), the greater the weight, and the less the weight. See the word "Atomic energy" in the Web page, more or less can understand the theme of the Web page, and see the word "Application", the topic is basically still ignorant. Therefore, the weight of "atomic energy" should be greater than the application.

The weight of the 2> stop word should be zero.

Easy to find, if a keyword only in a few pages appear, through which it is easy to lock the search target, its weight should be relatively large. Anyway, if a word appears in a large number of pages, it's still unclear what content to look for, so it should be small in weight. (for example, you are searching for "Python gensim", "python" this keyword will appear in many pages, the content may be Python introduction, Python website, python application, and "Gensim" only in relatively few pages appear, Generally so Gensim's official website, Gensim installation tutorials, Gensim's learning notes, etc., while the latter is something we tend to see more.


IDF:

We have learned that some words should occupy more weight when searching for certain content, but how to determine the weight of each word? In information retrieval, the most used weight is the IDF (inverse Text frequency index). Formula:.

Meaning: If a keyword appears in the page, in the total number of pages, the larger the value, we think the weight of the keyword is smaller, and vice versa. (such as the keyword "python" appears in 100,000 pages, and "Gensim" appears only in 1000 pages, then the "Gensim" weight will be more than "Python", so the search results will be closer to the results you want). For example, assuming that the number of Chinese pages is, the stop word appears in all pages, i.e., its IDF = log (1 billion/1 billion) = log (1) = 0. If the special word "atomic energy" appears in 2 million pages, namely DW=200 million, then its weight idf=log (500) = 2.7. Also assume that the generic word "Application" appears in 500 million pages, its weight IDF = log (2) is only 0.3.

In other words, finding a "atomic" histograms in the Web page is equivalent to finding a match of nine "apps". Using IDF, the formula for the correlation calculation is changed from a simple summation of the word frequency to a weighted sum, i.e. TF1*IDF1 + tf2*idf2 + ... + TFN*IDFN. In the example above, the Web page and the "Application of atomic energy" have a correlation of 0.0069, of which "atomic energy" contributed 0.0054, while "application" contributed only 0.0015. This ratio is quite consistent with our intuition.

The concept of TF-IDF is recognized as the most important invention in information retrieval. In search, literature classification, and other related fields have a wide range of applications.


The TF-IDF algorithm of the beauty of mathematics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.