Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Duplicate Web content is very harmful to search engines. The existence of duplicate pages means that these pages will be processed by the search engine more than once. What's more harmful is that search engine indexing may index two identical pages in the index library. When someone queries, a duplicate page link appears in the search results. So the heavy web pages are harmful both from the search experience to the quality of system efficiency searches.
Web page search technology originated from the replication detection technology, that is, to determine whether a file content plagiarism, copying another one or more files technology.
1993 Arizona University's Manber (Google Now Vice President, engineer) launched a SIF tool to look for similar files. 1995 Stanford University of Brin (Sergey Brin,google founder) and Garcia-molina and others in the "Digital Book View" project, the first proposed text replication detection mechanism cops (copy homeowner system) The system and the corresponding algorithm [Sergey Brin et al 1995]. After this detection repetition technique is applied to the search engine, the basic core technology is similar.
Web pages are different from simple documents, and the special attributes of a Web page have tags such as content and formatting, so the same similarity in content and format makes up 4 of similar types of Web pages.
1, two page content format exactly the same.
2, two page content is the same, but the format is different.
3, two pages part of the same content and the same format.
4, two page parts are important the same but the format is different.
Implementation method:
Page check weight, first of all, the Web page is organized into a title and body document, to facilitate the search heavy. So the Web page check heavy again called "Document check Weight". "Document check weight" is generally divided into three steps,
Feature extraction.
Second, similarity calculation and evaluation.
Third, the elimination of heavy.
1. Feature Extraction
When we judge the similarity, we can usually compare the invariant features, and the first step of file checking is to feature extraction. That is, the content of the document is decomposed, which is represented by a set of features that make up the document, and this step is to compare and compute the similarity of the later features.
Feature extraction has many methods, we mainly say two kinds of classical algorithms, "I-match algorithm", "Shingle algorithm."
The "I-match algorithm" is not dependent on the complete information analysis, but uses the statistical characteristics of the data set to extract the main features of the document and discard the non main features.
The "shingle algorithm" is used to extract multiple feature words and compare the similarity of two feature sets to achieve document weight checking.
2. Calculation and evaluation of similarity
After the feature extraction, we need to carry on the characteristic contrast, because the second step of webpage checking is similarity calculation and evaluation.
I-match algorithm has only one feature, when input a document, according to the terms of the IDF (inverse text frequency index, inverse document frequency abbreviation for IDF) filter out some key features, That is, a particular high and low frequency words in an article often do not reflect the nature of this article. So the high-frequency and low-frequency words are removed from the document and the unique hash value of the document is computed (hash simply maps the data value to the address). The value of the data as input, after calculation can get the address value. , documents with the same hash value are duplicated.
Shingle algorithm is to extract a number of features for comparison, so the processing is more complex, the comparison method is exactly the number of shingle. Then divided by the total number of shingle in two documents minus the number of consistent shingle, this method calculates the value "Jaccard coefficient", which can be used to determine the similarity of the set. The intersection of the set of methods for calculating the Jaccard coefficients is divided by the set.
3. Weight dissipation
For deletion of duplicate content, the search engine takes into account many factors, so the simplest and most practical method is used. The first crawler-crawled page also guarantees a high degree of priority in preserving original Web pages.
Web page Check heavy work is indispensable in the system, delete the duplicate page, so the other links of the search engine will also reduce a lot of unnecessary trouble, save the index storage space, reduce the query cost, improve the efficiency of PageRank calculation. Convenient for search engine users.