[Crawler learning notes] Building of the SimHash-based deduplication module ContentSeen, simhashcontentseen

Source: Internet
Author: User

[Crawler learning notes] Building of the SimHash-based deduplication module ContentSeen, simhashcontentseen

   

Some websites on the Internet often have mirror websites (mirror), that is, the content of the two websites is the same but the Domain Name of the webpage is different. This will cause repeated crawling of the same web page crawler for multiple times. To avoid this problem, you must first enter the ContentSeen module for each captured webpage. This module checks whether the content of a webpage is consistent with that of a previously downloaded webpage. if the content is consistent, the webpage will not be sent for further processing. This method can significantly reduce the number of webpages that crawlers need to download. To determine whether the content of the two webpages is consistent, the general idea is as follows: instead of directly comparing the content of the two webpages, the content of the webpages is computed to generate FingerPrint (FingerPrint ), generally, FingerPrint is a fixed-length string, which is much shorter than the webpage body. If the FingerPrint of the two web pages is the same, they are considered to have the same content.

To complete this module, we first need a powerful fingerprint algorithm to calculate the content of our webpage as a fingerprint and store it in the database, the next time you directly determine that the fingerprint is saved, you can use the fingerprint comparison to successfully repeat the operation.

   

First, let's take a look at the famous Google company's use of the web page to repeat the SimHash algorithm:

In a paper published by GoogleMoses Charikar, "detecting near-duplicates for web crawling", simhash algorithms are proposed to solve billions of page deduplication tasks.

SimHash is a type of locality sensitive hash (partial sensitive hash:

The main idea is to reduce the dimension and map high-dimensional feature vectors into low-dimensional feature vectors. the Hamming Distance of the two vectors is used to determine whether the articles are repeated or highly approximate.

Hamming Distance, also known as Hamming Distance, indicates that in information theory, the Hamming Distance between two equal-length strings is the number of different characters at the corresponding positions of the two strings. That is to say, it is the number of characters to replace a string into another string. For example, the Hamming distance between 1011101 and 1001001 is 2. As we often say, the string editing distance is the normal Hamming distance.

In this way, we can obtain similarity by comparing the SimHash distance of multiple documents.

For details, refer to the SimHash algorithm here.

_______________________________________________________________________________________________

   

The Code is as follows:

Using System; using System. collections. generic; using System. linq; namespace Crawler. common {public class SimHashAnalyser {private const int HashSize = 32; public static float GetLikenessValue (string needle, string haystack, TokeniserType = TokeniserType. overlapping) {var needleSimHash = GetSimHash (needle, type); var hayStackSimHash = GetSimHash (haystack, type); return GetLikenessValue (needleSi MHash, hayStackSimHash);} public static float GetLikenessValue (int needleSimHash, int hayStackSimHash) {return (HashSize-GetHammingDistance (needleSimHash, hayStackSimHash)/(float) HashSize ;} private static IEnumerable <int> DoHashTokens (IEnumerable <string> tokens) {return tokens. select (token => token. getHashCode ()). toList ();} private static int GetHammingDistance (int firstValue, int second Value) {var hammingBits = firstValue ^ secondValue; var hammingValue = 0; for (var I = 0; I <32; I ++) if (IsBitSet (hammingBits, I )) hammingValue + = 1; return hammingValue;} private static bool IsBitSet (int B, int pos) {return (B & (1 <pos ))! = 0;} public static int GetSimHash (string input) {return GetSimHash (input, TokeniserType. overlapping);} public static int GetSimHash (string input, TokeniserType tokeniserType) {ITokeniser tokeniser; if (tokeniserType = TokeniserType. overlapping) tokeniser = new OverlappingStringTokeniser (); else tokeniser = new FixedSizeStringTokeniser (); var hashedtokens = DoHashTokens (tokeniser. tokenise (input); var vector = new int [HashSize]; for (var I = 0; I <HashSize; I ++) vector [I] = 0; foreach (var value in hashedtokens) for (var j = 0; j <HashSize; j ++) if (IsBitSet (value, j) vector [j] + = 1; else vector [j]-= 1; var fingerprint = 0; for (var I = 0; I <HashSize; I ++) if (vector [I]> 0) fingerprint + = 1 <I; return fingerprint;} public interface ITokeniser {IEnumerable <string> Tokenise (string input);} public class FixedSizeStringTokeniser: ITokeniser {private readonly ushort _ tokensize; public FixedSizeStringTokeniser (ushort tokenSize = 5) {if (tokenSize <2) throw new ArgumentException ("Token cannot exceed the range"); if (tokenSize> 127) throw new ArgumentException ("Token cannot exceed the range"); _ tokensize = tokenSize;} public IEnumerable <string> Tokenise (string input) {var chunks = new List <string> (); var offset = 0; while (offset <input. length) {chunks. add (new string (input. skip (offset ). take (_ tokensize ). toArray (); offset + = _ tokensize;} return chunks;} public class OverlappingStringTokeniser: ITokeniser {private readonly ushort _ chunkSize; private readonly ushort _ overlapSize; public partition (ushort chunkSize = 4, ushort overlapSize = 3) {if (chunkSize <= overlapSize) throw new ArgumentException ("Chunck must be greater than overlap"); _ overlapSize = overlapSize; _ chunkSize = chunkSize;} public IEnumerable <string> Tokenise (string input) {var result = new List <string> (); var position = 0; while (position <input. length-_ chunkSize) {result. add (input. substring (position, _ chunkSize); position + = _ chunkSize-_ overlapSize;} return result ;}} public enum TokeniserType {Overlapping, FixedSize }}

 

The call method is as follows:

Var s1 = "the cat sat on the mat. "; var s2 =" the cat sat on a mat. "; var similarity = SimHashAnalyser. getLikenessValue (s1, s2); Console. clear (); Console. writeLine ("similarity: {0 }%", similarity * 100); Console. readKey ();

 

Output:

Degree of similarity: 78.125%
  
The following is a simple encapsulation of the ContentSeen module:
Using Crawler. Common; namespace Crawler. Processing {// <summary> // For each captured webpage, you must first enter the Content Seen module. This module checks whether the content of a webpage is consistent with that of a previously downloaded webpage. if the content is consistent, the webpage will not be sent for further processing. /// </Summary> public class ContentSeen {public static int GetFingerPrint (string html) {return SimHashAnalyser. getSimHash (html);} public static float Similarity (int print1, int print2) {return SimHashAnalyser. getLikenessValue (print1, print2 );}}}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.