How does a search engine calculate weights?

Source: Internet
Author: User
Tags idf

Let's take a look at the word "atomic energy application ".Search EngineHow to recognize wordsWeightAnd how to calculate:
The phrase "Atomic Energy Application" can be divided into threeKeywords: Atomic Energy, application. Based on our intuition, we know that webpages with more words are more relevant than those with fewer words. Of course, there is an obvious loophole in this method, that is, long web pages occupy less cost than short Web pages, because long web pages generally contain more keywords. Therefore, we need to normalize the number of keywords based on the length of the web page, that is, divide the number of keywords by the total number of words on the web page. We call this operator the frequency of a keyword, or the frequency of a word in a single text. For example, on a webpage with a total of one thousand words, "atomic energy", "" and "application" appear twice, 35 times, and 5 times respectively, the frequencies are 0.002, 0.035, and 0.005 respectively. We add these three numbers, and 0.042 and are the corresponding webpage and query "atomic energy application"
A simple measurement of correlation. In summary, if a query contains the keywords W1, W2 ,..., Wn, which have the following frequencies on a specific webpage: TF1, TF2 ,..., TFN. (TF: Term Frequency ). Then, the correlation between this query and the webpage is:
TF1 + TF2 +... + TFN.

Readers may have discovered another vulnerability. In the above example, the word "" stands for more than 80% of the total word frequency, and it is almost useless to determine the subject of the webpage. We call this term "stopwords", which means that the frequency of measurement correlation should not be considered. In Chinese, dozens of words such as "yes", "and", "medium", "region", and "de" should be deleted. After ignoring these words, the similarity of the above Web pages becomes 0.007, of which "Atomic Energy" contributed 0.002, and "application" contributed 0.005.
Careful readers may find another small vulnerability. In Chinese, "application" is a common word, while "Atomic Energy" is a very professional word. The latter is more important than the former in relevance ranking. Therefore, we need to give a weight to each word in Chinese. The weight must meet the following two conditions:

1. The stronger the topic ability of a word, the larger the weight, and the smaller the weight. We can see the word "Atomic Energy" on the webpage to learn more or less about the subject of the webpage. We can see the "application" once, and basically do not know anything about the topic. Therefore, the "Atomic Energy" should have a higher weight than the application.
2. The weight of words to be deleted should be zero.
It is easy to find that if a keyword only appears on a small number of webpages, we can easily lock the search target through it, and its weight should be large. If a word appears in a large number of web pages, we can see that it still does not know what to look for, so it should be small. In summary, if a keyword W appears in DW webpages, the larger the DW, the smaller the weight of W, and vice versa. In information retrieval, the most commonly used weight is "inverse text frequency index" (inverse Document Frequency abbreviated to IDF), and its formula is log (D/DW) d indicates the number of all webpages. For example, assume that the Chinese web page number is d = 1 billion and the word "of" should be deleted and appear on all webpages, that is, DW = 1 billion. Then its IDF = Log (1 billion/1 billion) = Log (1) = 0. If the special term "Atomic Energy" appears on 2 million webpages, that is, DW = 2 million, its weight is IDF = Log (500) = 6.2. It is assumed that the general term "application" appears in the 0.5 billion webpages, and its weight IDF = Log (2)

Only 0.7. That is to say, finding a "Atomic Energy" ratio on the webpage is equivalent to finding nine "Applications. Using IDF, the formula for calculating the correlation is changed from simple summation of Word Frequency to weighted summation, that is, TF1 * idf1 + TF2 * idf2 +... + TFN * idfn. In the preceding example, the correlation between the webpage and the "Atomic Energy Application" is 0.0161, of which "Atomic Energy" contributes 0.0126, while "application" only contributes 0.0035. This ratio is consistent with our intuition.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.