The similarity judgment of Crawler crawl Web page

Source: Internet
Author: User
Tags bitmask

Crawler Crawl Web process, there will be a lot of problems, of course, one of the most important problem is to repeat the problem, the Web page of repeated crawl. The simplest way is to go to the URL. URLs that have been crawled are no longer crawled. But actually in the actual business, it is necessary to crawl the URLs already crawled. For example, BBS There is a large number of updates to the BBS, but the URL does not change.

In general, the URL to go to the weight of the way, is to determine whether the URL is crawled, if the crawl is no longer crawl, or in a certain time no longer crawl.

My needs are the same, so the first thing to do is URL to heavy.   When the crawler discovers the link and joins the queue to be crawled, it validates the URL, crawls it, or whether the URL needs to be crawled again at the current time. If the final content captured in the condition time needs to be put into storage?

Because of business problems:

1. DM timed out the data in the database.

2. ETL needs to take out the data in the database at timed time.

But they do not judge whether the data has been processed before. So it will also cause redundancy, for the crawler. The data crawled every day at least millions, if not to re-judge, DM ETL and other required data of the personnel to deal with a large number of processed data every day.

A lot of people are dealing with MD5 or some other way of encryption. Even several combinations ... OK, I'll just talk about the problem.

Each visit to the page will increase the number of times, the content of the site modification is only 1 per thousand, a character ... But for MD5, the whole world was modified. The entire encryption has changed ...

This is the MD5 of my record. Inside the content for the original article: The article changes the content to see:

Tested. MD5 encrypted body is used for crawler to do the re-judgment. Very Low applicability ... So we changed the plan.

Use the Simhash algorithm for data encryption for crawling.

import java.math.biginteger;import java.util.arraylist;import java.util.list;import  java.util.stringtokenizer;public class simhash {private string tokens;public  biginteger intsimhash;private string strsimhash;private int hashbits = 64; Public simhash (String tokens)  {this.tokens = tokens;this.intSimHash =  This.simhash ();} Public simhash (string tokens, int hashbits)  {this.tokens = tokens; This.hashbits = hashbits;this.intsimhash = this.simhash ();} Public biginteger simhash ()  {//  defining eigenvectors/Arrays int[] v = new int[ this.hashbits];//  modified to participle stringtokenizer stringtokenizer = new stringtokenizer (tokens); while  (Stringtokenizer.hasmoretokens ())  {string temp = stringtokenizer.nexttoken ();  // tempbiginteger t = this.hash (Temp);for  (int i = 0; i < this.hashbits; i++)  {BigInteger  Bitmask = new biginteger ("1"). Shiftleft (i);if  (T.and (bitmask). Signum ()  != 0)  {v[i] += 1;}  else {v[i] -= 1;}}} Biginteger fingerprint = new biginteger ("0"); Stringbuffer simhashbuffer = new stringbuffer ();for  (int i = 0; i  < this.hashbits; i++)  {// 4, the last logarithmic group to judge, greater than 0 of the record is 1, less than or equal to 0 of the 0, to get a  64bit  The digital fingerprint/signature .if  (v[i] >= 0)  {fingerprint = fingerprint.add (New BigInteger ("1 "). Shiftleft (i)); Simhashbuffer.append (" 1 ");}  else {simhashbuffer.append ("0");}} This.strsimhash = simhashbuffer.tostring (); return fingerprint;} @SuppressWarnings ({  "rawtypes",  "unused",  "unchecked" &NBSP;}) public list  Subbydistance (simhash simhash, int distance) &nbsP {//  are divided into groups to check int numeach = this.hashbits /  (distance + 1); List characters = new arraylist (); Stringbuffer buffer = new stringbuffer ();int k = 0;for  (int i  = 0; i < this.intsimhash.bitlength ();  i++)  {//  returns if and only if the specified bit is set.  trueboolean sr = simhash.intsimhash.testbit (i);if  (SR)  {buffer.append ("1");}  else {buffer.append ("0");} if  ((i + 1)  % numeach == 0)  {//  converting binary to bigintegerbiginteger  Eachvalue = new biginteger (Buffer.tostring (),  2); Buffer.delete (0, buffer.length ()); Characters.add (Eachvalue);}} Return characters;} Public int getdistance (STRING&NBSP;STR1,&NBSP;STRING&NBSP;STR2)  {int distance;if  ( Str1.length ()  != str2.length ())  {distance = -1;}  else {distance = 0;for  (Int i = 0; i < str1.length ();  i++)  {if  (Str1.charat (i)  != str2.charat (i))  {distance++;}}} Return distance;} /** *  calculation hash *  *  @param  source *  @return  */private  Biginteger hash (String source)  {if  (source == null | |  source.length ()  == 0)  {return new biginteger ("0");}  else {char[] sourcearray = source.tochararray (); Biginteger x = biginteger.valueof (((long)  sourcearray[0])  << 7); Biginteger m = new biginteger ("1000003"); Biginteger mask = new biginteger ("2"). Pow (this.hashbits). Subtract (New biginteger ("1") );for  (Char item : sourcearray)  {biginteger temp = biginteger.valueof ( (long)  item); X = x.multiply (M). XOR (Temp). and (mask); X = x.xor (New biginteger (String.valueof (Source.length ()));if  (X.equals (New biginteger ("1"))  {x  = new biginteger ("2");} Return x;}} /** *  calculating Hamming distance  *  *  @param  other *  @return  */public int  hammingdistance (Simhash other)  {biginteger x = this.intsimhash.xor ( Other.intsimhash);int tot = 0;//  Statistics x bits number is 1//  we think, a binary number minus 1, then, The number from the last 1 (including that 1) is all reversed, right, then,n& (n-1) is equivalent to the back of the figure 0,//  we see how many times N can do this operation is OK. while  (X.signum ()  != 0)  {tot += 1;x = x.and (X.subtract (new  BigInteger ("1")));} Return tot;}}

Also for testing the site just now:

Results:

As you can see, a simple modification of a Web site does not affect the resulting hash value. However, if the site is updated, the results will be modified directly. Suitable for crawler crawl to re-update. So it is recommended to use this method to do the re-

The similarity judgment of Crawler crawl Web page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.