Some sites on the internet often have mirror sites (mirror), which are the same as the contents of two Web sites, but the corresponding domain names are different. This causes repeated crawls of the same web crawler multiple times. To avoid this, for each page crawled, it first needs to enter the Contentseen module. The module will determine whether the content of the page is the same as the content of a page that has been downloaded, and if so, the page will no longer be sent to the next step of processing. This can significantly reduce the number of pages the crawler needs to download. As to whether the content of the two pages is consistent, the general idea is this: do not directly compare the content of two pages, but the content of the page is calculated to generate fingerprint (fingerprint), usually fingerprint is a fixed-length string, much shorter than the body of the page. If the fingerprint of two pages are the same, they are considered to be identical in content.
In order to complete this module, first we need a powerful fingerprint algorithm, the content of our web page to calculate the fingerprint into the database, the next time directly determine the fingerprint before saving through the fingerprint comparison can be successfully completed to repeat the operation.
First look at the famous Google company used by the Web to repeat the algorithm Simhash it:
A paper published by Googlemoses Charikar, "Detecting near-duplicates for Web crawling", proposes a simhash algorithm specifically designed to solve the heavy lifting tasks of billions of pages.
Simhash as a locality sensitive hash (locally sensitive hash):
The main idea is to reduce dimensions, to map high-dimensional eigenvectors into low-dimensional eigenvectors, and to determine whether the article is repetitive or highly approximate by the Hamming distance of two vectors.
Among them, Hamming Distance, also known as Hamming distance, in the information theory, the Hamming distance between two equal length strings is the number of different characters in the corresponding position of two strings. That is, it is the number of characters that need to be replaced to transform a string into another string. For example: The Hamming distance between 1011101 and 1001001 is 2. As we often say, the string editing distance is the general form of Hamming distance.
Thus, by comparing the hamming distances of the Simhash values of multiple documents, they can be obtained in similar degrees.
Details can be seen here Simhash algorithm
_______________________________________________________________________________________________
Let's implement the Code:
usingSystem;usingSystem.Collections.Generic;usingSystem.Linq;namespacecrawler.common{ Public classSimhashanalyser {Private Const intHashsize = +; Public Static floatGetlikenessvalue (stringNeedlestringHaystack, Tokenisertype type =tokenisertype.overlapping) {varNeedlesimhash =Getsimhash (needle, type); varHaystacksimhash =Getsimhash (haystack, type); returnGetlikenessvalue (Needlesimhash, Haystacksimhash); } Public Static floatGetlikenessvalue (intNeedlesimhash,intHaystacksimhash) { return(Hashsize-gethammingdistance (Needlesimhash, Haystacksimhash))/(float) Hashsize; } Private Staticienumerable<int> Dohashtokens (ienumerable<string>tokens) { returnTokens. Select (token =token. GetHashCode ()). ToList (); } Private Static intGethammingdistance (intFirstvalue,intsecondvalue) { varHammingbits = firstvalue ^SecondValue; varHammingvalue =0; for(vari =0; I < +; i++) if(Isbitset (hammingbits, i)) Hammingvalue+=1; returnHammingvalue; } Private Static BOOLIsbitset (intBintPOS) { return(B & (1<< pos))! =0; } Public Static intGetsimhash (stringinput) { returnGetsimhash (input, tokenisertype.overlapping); } Public Static intGetsimhash (stringinput, Tokenisertype tokenisertype) {Itokeniser tokeniser; if(Tokenisertype = =tokenisertype.overlapping) Tokeniser=NewOverlappingstringtokeniser (); ElseTokeniser=NewFixedsizestringtokeniser (); varHashedtokens =Dohashtokens (Tokeniser. Tokenise (input)); varVector =New int[Hashsize]; for(vari =0; i < hashsize; i++) Vector[i]=0; foreach(varValueinchhashedtokens) for(varj =0; J < Hashsize; J + +) if(Isbitset (value, J)) Vector[j]+=1; ElseVector[j]-=1; varFingerprint =0; for(vari =0; i < hashsize; i++) if(Vector[i] >0) Fingerprint+=1<<i; returnfingerprint; } } Public InterfaceItokeniser {IEnumerable<string> Tokenise (stringinput); } Public classFixedsizestringtokeniser:itokeniser {Private ReadOnly ushort_tokensize; PublicFixedsizestringtokeniser (ushortTokensize =5) { if(Tokensize <2) Throw NewArgumentException ("Token cannot be out of range"); if(Tokensize >127) Throw NewArgumentException ("Token cannot be out of range"); _tokensize=tokensize; } Publicienumerable<string> Tokenise (stringinput) { varchunks =Newlist<string>(); varoffset =0; while(Offset <input. Length) {chunks. ADD (New string(input. Skip (offset). Take (_tokensize). ToArray ())); Offset+=_tokensize; } returnchunks; } } Public classOverlappingstringtokeniser:itokeniser {Private ReadOnly ushort_chunksize; Private ReadOnly ushort_overlapsize; PublicOverlappingstringtokeniser (ushortChunkSize =4,ushortOverlapsize =3) { if(ChunkSize <=overlapsize)Throw NewArgumentException ("Chunck must be greater than overlap"); _overlapsize=overlapsize; _chunksize=chunkSize; } Publicienumerable<string> Tokenise (stringinput) { varresult =Newlist<string>(); varPosition =0; while(Position < input. Length-_chunksize) {result. ADD (input. Substring (position, _chunksize)); Position+ = _chunksize-_overlapsize; } returnresult; } } Public enumTokenisertype {overlapping, fixedsize}}
The calling method is as follows:
var " The cat sat on the mat. " ; var " The cat sat on a mat. " ; var similarity = simhashanalyser.getlikenessvalue (s1, S2); Console.clear (); Console.WriteLine (" similarity: {0}%"); Console.readkey ();
The output is:
Degree of similarity: 78.125%
The next step is a simple package for the Contentseen module:
usingCrawler.common;namespacecrawler.processing{/// <summary> ///for each page crawled, it first needs to go into the content seen module. The module will determine whether the content of the page is the same as the content of a page that has been downloaded, and if so, the page will no longer be sent to the next step of processing. /// </summary> Public classContentseen { Public Static intGetfingerprint (stringhtml) { returnSimhashanalyser.getsimhash (HTML); } Public Static floatSimilarity (intPrint1,intPrint2) { returnSimhashanalyser.getlikenessvalue (Print1, Print2); } }}
[Crawler Learning Notes] Construction of Simhash-based de-duplication processing module Contentseen