[Crawler Learning Notes] Construction of Simhash-based de-duplication processing module Contentseen

Last Update:2016-09-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some sites on the internet often have mirror sites (mirror), which are the same as the contents of two Web sites, but the corresponding domain names are different. This causes repeated crawls of the same web crawler multiple times. To avoid this, for each page crawled, it first needs to enter the Contentseen module. The module will determine whether the content of the page is the same as the content of a page that has been downloaded, and if so, the page will no longer be sent to the next step of processing. This can significantly reduce the number of pages the crawler needs to download. As to whether the content of the two pages is consistent, the general idea is this: do not directly compare the content of two pages, but the content of the page is calculated to generate fingerprint (fingerprint), usually fingerprint is a fixed-length string, much shorter than the body of the page. If the fingerprint of two pages are the same, they are considered to be identical in content.

In order to complete this module, first we need a powerful fingerprint algorithm, the content of our web page to calculate the fingerprint into the database, the next time directly determine the fingerprint before saving through the fingerprint comparison can be successfully completed to repeat the operation.

First look at the famous Google company used by the Web to repeat the algorithm Simhash it:

A paper published by Googlemoses Charikar, "Detecting near-duplicates for Web crawling", proposes a simhash algorithm specifically designed to solve the heavy lifting tasks of billions of pages.

Simhash as a locality sensitive hash (locally sensitive hash):

The main idea is to reduce dimensions, to map high-dimensional eigenvectors into low-dimensional eigenvectors, and to determine whether the article is repetitive or highly approximate by the Hamming distance of two vectors.

Among them, Hamming Distance, also known as Hamming distance, in the information theory, the Hamming distance between two equal length strings is the number of different characters in the corresponding position of two strings. That is, it is the number of characters that need to be replaced to transform a string into another string. For example: The Hamming distance between 1011101 and 1001001 is 2. As we often say, the string editing distance is the general form of Hamming distance.

Thus, by comparing the hamming distances of the Simhash values of multiple documents, they can be obtained in similar degrees.

Details can be seen here Simhash algorithm

_______________________________________________________________________________________________

Let's implement the Code:

usingSystem;usingSystem.Collections.Generic;usingSystem.Linq;namespacecrawler.common{ Public classSimhashanalyser {Private Const intHashsize = +;  Public Static floatGetlikenessvalue (stringNeedlestringHaystack, Tokenisertype type =tokenisertype.overlapping) {varNeedlesimhash =Getsimhash (needle, type); varHaystacksimhash =Getsimhash (haystack, type); returnGetlikenessvalue (Needlesimhash, Haystacksimhash); }         Public Static floatGetlikenessvalue (intNeedlesimhash,intHaystacksimhash) {            return(Hashsize-gethammingdistance (Needlesimhash, Haystacksimhash))/(float) Hashsize; }        Private Staticienumerable<int> Dohashtokens (ienumerable<string>tokens) {            returnTokens. Select (token =token. GetHashCode ()).        ToList (); }        Private Static intGethammingdistance (intFirstvalue,intsecondvalue) {            varHammingbits = firstvalue ^SecondValue; varHammingvalue =0;  for(vari =0; I < +; i++)                if(Isbitset (hammingbits, i)) Hammingvalue+=1; returnHammingvalue; }        Private Static BOOLIsbitset (intBintPOS) {            return(B & (1<< pos))! =0; }         Public Static intGetsimhash (stringinput) {            returnGetsimhash (input, tokenisertype.overlapping); }         Public Static intGetsimhash (stringinput, Tokenisertype tokenisertype)            {Itokeniser tokeniser; if(Tokenisertype = =tokenisertype.overlapping) Tokeniser=NewOverlappingstringtokeniser (); ElseTokeniser=NewFixedsizestringtokeniser (); varHashedtokens =Dohashtokens (Tokeniser.            Tokenise (input)); varVector =New int[Hashsize];  for(vari =0; i < hashsize; i++) Vector[i]=0; foreach(varValueinchhashedtokens) for(varj =0; J < Hashsize; J + +)                    if(Isbitset (value, J)) Vector[j]+=1; ElseVector[j]-=1; varFingerprint =0;  for(vari =0; i < hashsize; i++)                if(Vector[i] >0) Fingerprint+=1<<i; returnfingerprint; }    }     Public InterfaceItokeniser {IEnumerable<string> Tokenise (stringinput); }     Public classFixedsizestringtokeniser:itokeniser {Private ReadOnly ushort_tokensize;  PublicFixedsizestringtokeniser (ushortTokensize =5)        {            if(Tokensize <2)                Throw NewArgumentException ("Token cannot be out of range"); if(Tokensize >127)                Throw NewArgumentException ("Token cannot be out of range"); _tokensize=tokensize; }         Publicienumerable<string> Tokenise (stringinput) {            varchunks =Newlist<string>(); varoffset =0;  while(Offset <input. Length) {chunks. ADD (New string(input. Skip (offset). Take (_tokensize).                ToArray ())); Offset+=_tokensize; }            returnchunks; }    }     Public classOverlappingstringtokeniser:itokeniser {Private ReadOnly ushort_chunksize; Private ReadOnly ushort_overlapsize;  PublicOverlappingstringtokeniser (ushortChunkSize =4,ushortOverlapsize =3)        {            if(ChunkSize <=overlapsize)Throw NewArgumentException ("Chunck must be greater than overlap"); _overlapsize=overlapsize; _chunksize=chunkSize; }         Publicienumerable<string> Tokenise (stringinput) {            varresult =Newlist<string>(); varPosition =0;  while(Position < input. Length-_chunksize) {result. ADD (input.                Substring (position, _chunksize)); Position+ = _chunksize-_overlapsize; }            returnresult; }    }     Public enumTokenisertype {overlapping, fixedsize}}

The calling method is as follows:

var " The cat sat on the mat. " ; var " The cat sat on a mat. " ; var similarity = simhashanalyser.getlikenessvalue (s1, S2); Console.clear (); Console.WriteLine (" similarity: {0}%"); Console.readkey ();

The output is:

Degree of similarity: 78.125%

The next step is a simple package for the Contentseen module:

usingCrawler.common;namespacecrawler.processing{/// <summary>    ///for each page crawled, it first needs to go into the content seen module.    The module will determine whether the content of the page is the same as the content of a page that has been downloaded, and if so, the page will no longer be sent to the next step of processing. /// </summary>     Public classContentseen { Public Static intGetfingerprint (stringhtml) {            returnSimhashanalyser.getsimhash (HTML); }         Public Static floatSimilarity (intPrint1,intPrint2) {            returnSimhashanalyser.getlikenessvalue (Print1, Print2); }    }}

[Crawler Learning Notes] Construction of Simhash-based de-duplication processing module Contentseen

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Crawler Learning Notes] Construction of Simhash-based de-duplication processing module Contentseen

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Crawler Learning Notes] Construction of Simhash-based de-duplication processing module Contentseen

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support