Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
This is a popular article about Search engine Sort Foundation tf frame, it is not some generalities and even out of context that is occasionally visible on the net, but combine the theory of search engine, and the practical knowledge that the more example that oneself observes is summed up. Although it may be relatively difficult to understand, believe me, these time to understand is definitely worth it.
Write this article is mainly to the next "SEO practice" series of articles to mention some of the content of the first to write a good basic theory, it will not put in the text to occupy space.
This article first quotes a Zhang Junlin "This is the search engine" in the TF framework of the overview. As the original text is longer, here is a summary of what I think is the focus, there may be a summary of deficiencies, so more detailed content recommended to read the original book.
(Note: "TF" or "TF" is the habit of writing, the book is TF, does not mean that there is a difference between)
TF principle Overview
When a user searches for a word in search engines, it will match the words to the document in the index library, take out a certain number of documents relevant to the word and take part in the subsequent ranking calculations. The "most relevant" measurements here are "weights", and for most search engines, the TF framework is a relatively important part of the calculation of weights. The main factors to be considered are: Word frequency tf and reverse document frequency IDF.
Frequency factor (TF)
The TF calculation factor represents the frequency of words, the number of times a word appears in the document. Generally speaking, the higher the frequency the more the document is related to the word, you should give the word a higher weight.
When the frequency factor is calculated, different calculation formulae can be adopted based on different starting point. The simplest way is to use the number of words directly, such as a word in a document appears 5 times, its TF value is 5.
A variant formula for the frequency factor is: W = 1+log (TF)
The word frequency value TF takes the log value as the frequency weight value, such as words appear 4 times in the document, the frequency factor weight is 3, the number 1 in the formula is for smoothing calculation. Because if the TF value is 1, the value of log 0, that is, a word that originally appeared, calculated in this way that the word never appeared in the document, in order to avoid this situation, the use of +1 of the way to smooth. The reason for the frequency of the log is based on the following considerations: Even if a word appeared 10 times, it should be in the calculation of the feature weight, compared to 1 times the value of the case weight 10 times times, so add log mechanism to curb this too big difference.
There is also a more important variant calculation formula to take into account the length of the document. Because the TF value of all words in a long document is generally higher than that of a short file. This is an unspecified mention.
Reverse document Frequency factor (IDF)
The IDF represents a global factor in the scope of a document collection that is related only to a given collection of documents, regardless of the specific document. So the IDF is not thinking about the characteristics of the document itself, but about the relative importance of the word.
The calculation formula is as follows: IDF = log (n)
where n represents the total number of documents in the document collection, and N represents the number of documents in which the feature word appears, that is, the document frequency. By formula, when the more documents contain a word, the smaller the IDF value means that the word is less capable of distinguishing between different documents.
TF Framework
The calculated formula for the TF value is:
Weight = TF * IDF
The larger the value, the more relevant the document is to the word.
The actual use of Baidu
For Baidu, the TF framework is naturally applied. But for a single index word ranking, TF is not the determinant of keyword ranking. Baidu's ranking is essentially a probabilistic retrieval model.
According to my previous simple statistical analysis of Baidu, Baidu for TF Computing at least using the above log smoothing calculation method. In addition to the previous mention, when a keyword appears more than a certain threshold, its TF value will increase with the number of occurrences, and continue to be in log form to reduce the rankings.
Because there is this mechanism exists, so a page above each word tf value is a different upper limit, which is an important concept for SEO.
The simplest way to actually experience TF calculations
Although not very precise, but first the number of occurrences of a keyword in an article is recorded as a TF value, and then Google search for the word, the word's total number of search results as DF value. Then divide the TF by DF, and you get the simplest tf value.
Although such calculations may be of little practical significance, it is much easier to understand TF after the actual calculation.
SEO derivative
For example, such as "inkjet price" a word, it will be Baidu into the "inkjet" and "price" two words. (digression, participle or not should also depend on the data rather than their intuition, if there is a chance I will write some of the methods I have recently used.) But some people often from the Baidu snapshot to see the keyword highlighted part to judge participle, is no basis for any facts, no value. )
To Google to search under the "inkjet" and "price" two words, "inkjet" results are about 20,600,000, "price" corresponds to about 1,850,000,000 search results, the latter of the DF value of about a hundredfold. (The reason is not to search Baidu, because Baidu shows the maximum number of search results 100 million)
In this case, even the "inkjet" and "price" two words appear in a document the same number of times, the latter will be due to the impact of the IDF factor, which led to a far lower weight than the former.
Therefore, in general, only when the "inkjet" the word weight of the high page, there is the opportunity to "inkjet price" the word ranking on the good performance, and "price" the word weight value of the relationship is very small. In any case, the weight of the term "price" is unlikely to get too much through the TF rule.
So at least for Baidu, want to do "Inkjet code machine Price" The word ranking, the general to use the "inkjet" ranking of the high landing page to do, otherwise it will be more difficult.
At last
Limited to their own level of SEO, can not jump on whether the SEO should go to the search engine for a very in-depth understanding, and at least the subjective point of view, I think SEO in the principle of the search engine is too deep is a matter of little significance. However, I think that I should only be able to master the basis of the search engine, if the most classical algorithms have not spent any effort to understand, and talk about how to deal with the search engine?
Original: http://semwatch.org/2012/03/tf-idf/