The previous article solved the download of the webpage in the specified path, but because the host value in the Link queue specified by heritrix is hashed as the key value, even if 100 threads are configured, only one thread is running, because heritrix fetches a URL from a queue each time by default, and obtains another one after the capture is complete. Because the specified path is basically in a host, it will become a single thread crawling, very slow.
In desperation, we can continue to rewrite it. This time, we modified the hostnamequeueassignmentpolicy, which is also the Default policy of the system. Previously, I inherited a replicactfrontier, but I haven't configured multiple threads for download after half a day. Instead, I just need to directly change the hostnamequeueassignmentpolicy. The main function is getclasskey, which is the key worth generating the queue. The elfhash algorithm is used to hash the URL.
The Code is as follows:
/* Hostnamequeueassignmentpolicy ** $ ID: hostnamequeueassignmentpolicy. java 3838 23: 00: 47z gojomo $ ** created on Oct 5, 2004 ** copyright (c) 2004 Internet Archive. ** this file is part of the heritrix web crawler (crawler.archive.org ). ** heritrix is free software; you can redistribute it and/or modify * it under the terms of the GNU lesser Public License as published by * the Free Software F Oundation; either version 2.1 of the license, or * any later version. ** heritrix is distributed in the hope that it will be useful, * but without any warranty; without even the implied warranty of * merchantability or fitness for a particle purpose. see the * GNU lesser Public License for more details. ** you shoshould have got Ed a copy of the GNU lesser Public License * along with heritrix; if not, W Rite to the Free Software * Foundation, inc ., 59 temple place, Suite 330, Boston, MA 02111-1307 USA */package Org. archive. crawler. frontier; import Java. util. logging. level; import Java. util. logging. logger; import Org. apache. commons. httpclient. uriexception; import Org. archive. crawler. datamodel. candidateuri; import Org. archive. crawler. framework. crawlcontroller; import org.archive.net. uuri; import Org. archive. Net. uurifacloud;/*** queueassignmentpolicy Based on the hostname: Port evident in the given * crawluri. ** @ author gojomo * // generate a public class hostnamequeueassignmentpolicy extends queueassignmentpolicy {Private Static final logger = logger. getlogger (hostnamequeueassignmentpolicy. class. getname ();/*** when neat host-based class-key fails us */Private Static string default_linoleic Ss_key = "default... "; Private Static final string DNS =" DNS "; // multi-threaded algorithm, Public String getclasskey (crawlcontroller controller, candidateuri cauri) {string uri = cauri. getuuri (). tostring (); long hash = elfhash (URI); string a = long. tostring (hash % 100); return a;} // elfhash hash algorithm public long elfhash (string Str) {long hash = 0; long x = 0; for (INT I = 0; I <Str. length (); I ++) {hash = (Ha Sh <4) + Str. charat (I); If (x = hash & 0xf0000000l )! = 0) {hash ^ = (x> 24); hash & = ~ X ;}} return (hash & 0x7fffffff);}/* Public String getclasskey (crawler controller, candidateuri cauri) {string scheme = cauri. getuuri (). getscheme (); string candidate = NULL; try {If (scheme. equals (DNS) {// is the domain name if (cauri. getvia ()! = NULL) {// special handling for DNS: Treat as being // of the same class as the triggering Uri. // when a URI between des a port, this ensures // The DNS lookup goes atop the Host: Port // queue that triggered it, rather than // some other host queue uuri viauuri = uurifacloud. getinstance (cauri. flattenvia (); candidate = viauuri. getauthorityminususerinfo (); // adopt scheme of triggering URI scheme = Viauuri. getscheme ();} else {candidate = cauri. getuuri (). getreferencedhost () ;}} else {candidate = cauri. getuuri (). getauthorityminususerinfo ();} If (Candidate = NULL | candidate. length () = 0) {candidate = default_class_key;} catch (uriexception e) {logger. log (level. info, "unable to extract class key; using default", e); candidate = default_class_key;} If (scheme! = NULL & scheme. equals (uurifacloud. HTTPS) {// If HTTPS and no port specified, add default HTTPS port to // distinguish HTTPS from HTTP server without a port. if (! Candidate. matches (". +: [0-9] + ") {candidate + = uurifacloud. https_port ;}}// ensure classkeys are safe as filenames on NTFS return candidate. replace (':', '#'); // The domain name is basically }*/}
It turns out that the download speed has been significantly improved, at least not a single thread, basically around kb.
Now, let's introduce it first. You can write a crawler yourself later.