Heritrix Summary-Rewrite hostnamequeueassignmentpolicy

Source: Internet
Author: User

The previous article solved the download of the webpage in the specified path, but because the host value in the Link queue specified by heritrix is hashed as the key value, even if 100 threads are configured, only one thread is running, because heritrix fetches a URL from a queue each time by default, and obtains another one after the capture is complete. Because the specified path is basically in a host, it will become a single thread crawling, very slow.

In desperation, we can continue to rewrite it. This time, we modified the hostnamequeueassignmentpolicy, which is also the Default policy of the system. Previously, I inherited a replicactfrontier, but I haven't configured multiple threads for download after half a day. Instead, I just need to directly change the hostnamequeueassignmentpolicy. The main function is getclasskey, which is the key worth generating the queue. The elfhash algorithm is used to hash the URL.

The Code is as follows:

/* Hostnamequeueassignmentpolicy ** $ ID: hostnamequeueassignmentpolicy. java 3838 23: 00: 47z gojomo $ ** created on Oct 5, 2004 ** copyright (c) 2004 Internet Archive. ** this file is part of the heritrix web crawler (crawler.archive.org ). ** heritrix is free software; you can redistribute it and/or modify * it under the terms of the GNU lesser Public License as published by * the Free Software F Oundation; either version 2.1 of the license, or * any later version. ** heritrix is distributed in the hope that it will be useful, * but without any warranty; without even the implied warranty of * merchantability or fitness for a particle purpose. see the * GNU lesser Public License for more details. ** you shoshould have got Ed a copy of the GNU lesser Public License * along with heritrix; if not, W Rite to the Free Software * Foundation, inc ., 59 temple place, Suite 330, Boston, MA 02111-1307 USA */package Org. archive. crawler. frontier; import Java. util. logging. level; import Java. util. logging. logger; import Org. apache. commons. httpclient. uriexception; import Org. archive. crawler. datamodel. candidateuri; import Org. archive. crawler. framework. crawlcontroller; import org.archive.net. uuri; import Org. archive. Net. uurifacloud;/*** queueassignmentpolicy Based on the hostname: Port evident in the given * crawluri. ** @ author gojomo * // generate a public class hostnamequeueassignmentpolicy extends queueassignmentpolicy {Private Static final logger = logger. getlogger (hostnamequeueassignmentpolicy. class. getname ();/*** when neat host-based class-key fails us */Private Static string default_linoleic Ss_key = "default... "; Private Static final string DNS =" DNS "; // multi-threaded algorithm, Public String getclasskey (crawlcontroller controller, candidateuri cauri) {string uri = cauri. getuuri (). tostring (); long hash = elfhash (URI); string a = long. tostring (hash % 100); return a;} // elfhash hash algorithm public long elfhash (string Str) {long hash = 0; long x = 0; for (INT I = 0; I <Str. length (); I ++) {hash = (Ha Sh <4) + Str. charat (I); If (x = hash & 0xf0000000l )! = 0) {hash ^ = (x> 24); hash & = ~ X ;}} return (hash & 0x7fffffff);}/* Public String getclasskey (crawler controller, candidateuri cauri) {string scheme = cauri. getuuri (). getscheme (); string candidate = NULL; try {If (scheme. equals (DNS) {// is the domain name if (cauri. getvia ()! = NULL) {// special handling for DNS: Treat as being // of the same class as the triggering Uri. // when a URI between des a port, this ensures // The DNS lookup goes atop the Host: Port // queue that triggered it, rather than // some other host queue uuri viauuri = uurifacloud. getinstance (cauri. flattenvia (); candidate = viauuri. getauthorityminususerinfo (); // adopt scheme of triggering URI scheme = Viauuri. getscheme ();} else {candidate = cauri. getuuri (). getreferencedhost () ;}} else {candidate = cauri. getuuri (). getauthorityminususerinfo ();} If (Candidate = NULL | candidate. length () = 0) {candidate = default_class_key;} catch (uriexception e) {logger. log (level. info, "unable to extract class key; using default", e); candidate = default_class_key;} If (scheme! = NULL & scheme. equals (uurifacloud. HTTPS) {// If HTTPS and no port specified, add default HTTPS port to // distinguish HTTPS from HTTP server without a port. if (! Candidate. matches (". +: [0-9] + ") {candidate + = uurifacloud. https_port ;}}// ensure classkeys are safe as filenames on NTFS return candidate. replace (':', '#'); // The domain name is basically }*/}

It turns out that the download speed has been significantly improved, at least not a single thread, basically around kb.

Now, let's introduce it first. You can write a crawler yourself later.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.