Absrtact: Relevant statistics show that: the number of Web pages approximately duplicated on the Internet accounts for as much as 29% of the total number of pages, and the identical pages account for about 22% of the total number of pages. Research shows that in a large information acquisition system, 30% of the Web pages are and other
Relevant statistical data indicate that: the number of pages that are approximately duplicated on the Internet accounts for as much as 29% of the total number of pages, and the exact same Web page accounts for about 22% of the total number of pages. Research shows that in a large information acquisition system, 30% of the pages are completely duplicated or approximately duplicated with another 70% of the pages.
That is: the Web page of a very high proportion of the Web page content is similar or exactly the same!
Search crawler crawl will produce the type of Web page repetition:
1. Multiple URL addresses to the same Web page and mirror site
such as: Www.sina.com and www.sina.com.cn
Point to the same site.
2. Repeat or approximate repetition of Web page content
such as plagiarism, the content of the transfer, garbage information, etc.
Web content similar to duplicate detection of two applications:
One: In the user search phase
The goal is to locate approximately duplicate documents in an existing indexed list and sort the output based on the query term given to the user.
Second: Crawler Crawl discovery Stage
For a new web page, the crawler uses a Web page to refactor the algorithm and ultimately decide whether or not to index it.
Approximate repeat page type, according to the content of the article and the combination of page layout format is divided into 4 forms:
One: Two documents are no different in content and layout format, and this repetition is called a full repeating page.
Two: Two documents have the same content, but the layout format is different, this repetition is called content repeating page.
Three: Two documents have some of the same content, and the layout format is the same, this repetition is called a layout repeat page.
Four: Two documents have some of the same important content, but the layout format is different, this repetition is called a partial repeating page.
The negative impact of repeated web pages on search engines:
Under normal circumstances, very similar web content can not or can only provide users with a small amount of new information, but the crawler crawl, indexing and user search will consume a lot of server resources.
Duplicate Web pages for search engine benefits:
If a Web page is highly repetitive, it is often a more popular embodiment of its content, and also indicates that the page is relatively important. Priority should be included. When the user searches, in the output order, also should be given a higher weight.
How duplicate documents are handled:
1. Delete
2. Group Duplicate documents
Search engine approximate Duplicate detection process:
Simhash Document Fingerprint calculation method:
1 to represent a document by extracting a collection of features with weights from the document. For example: The hypothesis feature is composed of words, and the weight of the word is determined by the frequency TF.
2 for each word, the binary value of n bits (usually 64 bits or more) is generated by the hash algorithm, as shown above, to generate 8-bit binary values for example. Each word corresponds to a different binary value.
3 in the N-dimensional (8-D) vector v, each dimension vector is computed separately. If the binary value of the corresponding bit bit is 1, the eigenvalue of the word is additive; if the bit value is 0, the subtraction is performed, and the vector is updated in this way.
4 when all the words in accordance with the above processing, if the vector v in the first I-dimensional is a positive number, then the n-bit fingerprint is set to 1, otherwise 0.
Jacccard method of similarity calculation:
As shown above, A and B represent 2 sets, and set C represents the same part of set A and B. The A collection contains 5 elements, the B set contains 4 elements, and the same element has 2, that is, the size of the set C is 2. The Jaccard calculates the proportion of the total element for the same elements in two sets.
As in the figure, set a and set B have 7 different elements, the same element number 2, so the similarity of set A and set B is: 2/7
In practical applications, the characteristics of set A and set B are hashed and converted into N-bit (64-bit or even more) binary values, thus converting the similarity of sets a and B into binary numerical comparisons, known as "Hamming distance" comparisons. The number of different binary values in the same position of two single-digit digits, such as 64 digits, is called "Hamming distance".
For a given document A, assume a feature extraction--the binary value after the hash fingerprint operation is: 1 0 0 0 0 0 1 0
For a given document B, suppose the binary value after the feature extraction-hash fingerprint operation is: 0 0 1 0 0 0 0 1
After comparison, the values for the 1th, 3rd, 7th, and 8th four positions of documents A and B are different, that is, the Hamming distance is 4. The more the number of bits in two documents, the greater the Hamming distance. The greater the Hamming distance, the greater the similarity between the two documents, the smaller the opposite.
Different search engines may use different Hamming distance values to determine whether the content of two Web pages is approximately duplicated. In general, it is considered that it is reasonable to use Hamming distance <=3 as a criterion to judge whether or not to approximate repetition for a 64-bit binary value.