Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Note: to the Rascal Rui "SEO depth Analysis" book knowledge, thank the author to write to us so good SEO knowledge.
"Guide" in the Internet so developed today, the same information will be published in a number of websites, the same news will be most of the media site coverage, coupled with small webmaster and SEO personnel tireless network collection, resulting in a large number of duplicate information on the network. However, when a user searches for a keyword, the search engine must not want to present the results to the user as the same content. Crawl these duplicate pages, to a certain extent, is the search engine's own resources waste, so the removal of duplicate content of the site has become a search engine is facing a major problem.
In the general search engine architecture, the Web page to heavy generally in the spider crawl part of the existence of the "go heavy" step in the entire search engine architecture implementation earlier, the more can save the follow-up processing system resources use. Search engines will generally have to crawl the repeated pages of the collation, for example, to determine whether a site contains a large number of duplicate pages, or whether the site is complete collection of other site content, to determine the site after the crawl or whether directly shielded crawl.
Go to heavy work will generally after participle and index before (also may be before participle), search engine will be in the page has been separated from the keyword, extract some representative keywords, and then calculate the "fingerprint" of these keywords. Each page will have a feature fingerprint, when the new crawl page's keyword fingerprint and indexed pages of the keyword fingerprint overlap, then the new page may be the search engine as a duplicate content and discard the index.
The actual work of the search engine, not only the use of Word segmentation steps to separate the meaningful keywords, but also the use of continuous cutting the way to extract keywords, and fingerprint calculation. Continuous cutting of the way to extract keywords, and fingerprint calculation, even cutting is a single word back to move the way to cut is a single word back to move the way to cut the word, for example, "Baidu began to crack down on the sale of links" will be cut into "Baidu Open" "Start" "Start playing" "Strike" "Fight to buy" "Hit Trading" " Buy chain "sell link". Then extract some keywords from these words for fingerprint calculation, participate in the repetition of the content of the comparison. This is just the search engine to identify duplicate pages of the basic algorithm, there are many other algorithms to deal with duplicate pages.
Therefore, the most popular online pseudo original tools, is not not to deceive the search engine, is to do the content of the ghosts are not read, so the theoretical use of ordinary false original tools can not get the normal collection of search engines and rankings. But because Baidu is not all duplicate pages are directly discard not index, but will be based on repeated web site weight appropriate to relax indexing standard, so that some cheaters take advantage of the high weight of the site, a large number of other sites to capture the content of search traffic. However, since June 2012, Baidu search for multiple upgrade algorithm, the collection of duplicate information, spam pages have repeatedly hit the level. So SEO in the face of the site content, should not be a false original point of view to build, and need to be useful to the user angle to build, although the latter content is not all original, generally if the site weight no big problem, will get healthy development. On the issue of originality, this book will be discussed in detail in the 12th chapter.
In addition, not only search engines need to "page to heavy", do their own site also need to page to the site to heavy. For example, classification information, business-to-business platform, such as UGC of the site, if not to limit the information released by users will inevitably have a lot of duplication, so not only in the SEO performance is not good, the user experience in the station will also be reduced a lot. Like SEO personnel in the design of flow products are generally common to "aggregation" based on the index page, special page or table of Contents page, "aggregation" must have the core word, do not filter, the vast number of core words will be expanded out of the page may have a lot of duplication, resulting in poor results of the product, and may even be the search engine down right.
"Go heavy" the general principle of the algorithm is as described above, interested friends can understand i-match, shingle, simhash and cosine to weight specific algorithm. Search engine to do "page to heavy" work before the first to analyze the page, the "noise" around the content will have an impact on the weight of the results, do this part of the work is only part of the operation on the content can be relatively simple, and can be very effective in supporting the production of high-quality "SEO products." As a SEO staff as long as the realization of the principle can be, the specific application in the product, the need for technical personnel to achieve. In addition, it involves efficiency, resource needs and other issues, and according to the actual situation "to heavy" work can also be carried out in a number of links (such as the core words of the word segment), SEO personnel as long as a little understanding of some of the principles, to be able to recommend a few technical staff is very good (technical personnel are not Not good at the field, at a specific time also need others to provide ideas. If the SEO personnel in these areas and technical personnel to carry out in-depth communication, technical staff will be more than SEO, at least no longer think that "SEO personnel will only modify the title, change the link, change the text, such as ' boring ' demand."
Summary: In the second thanks to the NH how good books, in the SEO thinking, find new knowledge, to weight, fingerprint principle. I hope brothers and sisters can go to see this book, today to share a part of it! In the days to come, the good knowledge will continue to be shared.
Reference has been made to the previous reprocessing site: http://www.91suichediao.com/