TF-IDF algorithm has been well-known by many professional SEO workers, it is a commonly used in information retrieval and information mining weighting technology, applied to the Web page analysis of the relevant keywords in the Web page weighting, analysis of a number of pages in a particular keyword related to the page keyword weight value, And the scientific basis is given in the final ranking algorithm.
First look at the TF*IDF formula: the TF*IDF value = TFXIDF (tf times IDF) = 1+log tf (t,d) XIDF (t) = 1+log TF (t,d) Xlog (N/DF (t)). Why do you want to analyze this formula? Because the TF-IDF value of a Web page, the more relevant text content in the Web page and the index word, the higher the weight on the search engine can be obtained, the ranking of the later pages can provide a lot of support.
TF frequency (Term Frequency) in TF*IDF, which indicates how often an entry appears in a document, and the IDF anti-document frequency (inverse document Frequency) indicates that if the number of documents containing entry T is less, the larger the IDF, It shows that entry T has a good classification ability, with a formula that the IDF can write as: IDF (t) = log (N/DF (t)). DF (t) represents the number of documents that contain a search term (represented by T), and n represents the total number of pages in the Internet.
Look at these concepts is difficult to understand thoroughly, give you an example, you can very well understand.
Using TF-IDF to explain the ranking phenomenon of "SEO diagnosis"
For example, "SEO diagnostics" this keyword page ranking, we check the top ten in the top three sites on the word words related to some frequency analysis:
In the second is the A5 SEO diagnosis, their "SEO" and "diagnosis" of the word frequency is 41 and, "SEO diagnosis" of the word frequency is 20;
Ranked in the third site is a Changsha company, their "SEO" and "diagnosis" of the word frequency is 12 and 4, "SEO diagnosis" of the word frequency is 1;
My sniff Rose Blog ranked tenth, the site "SEO" the highest frequency, reached 84, "diagnosis" of the word frequency is 7, "SEO diagnosis," The word frequency is 4.
Search to see About the "SEO Diagnostics" page about 1,530,000, "seo" and "diagnosis" is the limit of about 100,000,000 Baidu, take n=10000 billion. So three pages of three keywords TF*IDF values do the following calculations:
1, first calculate the IDF value of three words:
seo:idf= log (N/DF (t)) = log (10000/1) =4
Diagnostics: Idf= log (N/DF (t)) = log (10000/1) =4
SEO Diagnostics: Idf= log (N/DF (t)) = log (10000/0.015) = 7-log15≈6
2, calculate the TF value of three words:
Three Stations of keyword SEO tf value:
Changsha: tf= log (TF (t,d)) = log12≈1.1
a5:tf= log (TF (t,d)) = log41≈1.64
Fine smelling Rose: tf= log (TF (t,d)) = log84≈1.92
The TF value for the three-station keyword diagnostics:
Changsha: tf= log (TF (t,d)) = log4≈0.63
a5:tf= log (TF (t,d)) = log46≈1.68
Fine smelling Rose: tf= log (TF (t,d)) = log7≈0.84
Three Stations of keyword SEO diagnostic tf value:
Changsha: tf= log (TF (t,d)) = Log1=0
a5:tf= log (TF (t,d)) = log20≈1.45
Fine smelling Rose: tf= log (TF (t,d)) = log4≈0.63
3, three stations the TF*IDF value of three words is:
From the above table we can clearly see that my blog "seo" tf*idf the highest value, A5 Webmaster Network "Diagnosis" and "SEO diagnostic" TF*IDF the highest value.
If pure from the TF*IDF value calculation of relevance, "SEO diagnosis" the word ranking A5 Webmaster Network is the highest relevance, should get better rankings, my blog should be ranked between the two (the day before the ranking is indeed between the two), Changsha station should be in the end, But the actual result seems to be a certain gap. This shows that the site page ranking factors and other important factors, such as the overall weight of the site, a single page weight and quality, external links, and user interaction (that is, the user experience), these are all we need to consider.
In addition, the same site compared to see TF*IDF value, Changsha station and my fine smell rose blog to upgrade rankings, for the keyword "SEO" Ranking requirements are relatively high, "SEO" ranking plays a decisive role, and A5 webmaster station "SEO diagnosis" ranking played a decisive role, the keyword "seo" Rankings have little impact on their ranking fluctuations. There is a certain basis for this, for example, the day before yesterday my blog "SEO Diagnosis" ranked third, at that time, "SEO" keywords ranked tenth page, now fell to 23 pages, ranked down to tenth, so more use of TF*IDF research can help us find a lot of keyword ranking phenomenon, and targeted the formulation of SEO optimization strategy.
Of course, this calculation is based on an ideal state, but also can explain some of the causes of the phenomenon of SEO, as long as we can master the basic idea of TF*IDF algorithm, and then apply to the site optimization, will be able to better optimize the site, such as my blog, reduce the "SEO" the word for the impact of the site rankings, Might be able to better control the page's keyword "SEO diagnostics" rankings.
This article by Xu Zi Rain, hangzhou seo (http://www.soxunseo.com) search network to publish, Welcome to reprint, reprint, please keep this link, thank you for your cooperation!