Heritrix Summary-Rewrite hostnamequeueassignmentpolicy

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article solved the download of the webpage in the specified path, but because the host value in the Link queue specified by heritrix is hashed as the key value, even if 100 threads are configured, only one thread is running, because heritrix fetches a URL from a queue each time by default, and obtains another one after the capture is complete. Because the specified path is basically in a host, it will become a single thread crawling, very slow.

In desperation, we can continue to rewrite it. This time, we modified the hostnamequeueassignmentpolicy, which is also the Default policy of the system. Previously, I inherited a replicactfrontier, but I haven't configured multiple threads for download after half a day. Instead, I just need to directly change the hostnamequeueassignmentpolicy. The main function is getclasskey, which is the key worth generating the queue. The elfhash algorithm is used to hash the URL.

The Code is as follows:

/* Hostnamequeueassignmentpolicy ** $ ID: hostnamequeueassignmentpolicy. java 3838 23: 00: 47z gojomo $ ** created on Oct 5, 2004 ** copyright (c) 2004 Internet Archive. ** this file is part of the heritrix web crawler (crawler.archive.org ). ** heritrix is free software; you can redistribute it and/or modify * it under the terms of the GNU lesser Public License as published by * the Free Software F Oundation; either version 2.1 of the license, or * any later version. ** heritrix is distributed in the hope that it will be useful, * but without any warranty; without even the implied warranty of * merchantability or fitness for a particle purpose. see the * GNU lesser Public License for more details. ** you shoshould have got Ed a copy of the GNU lesser Public License * along with heritrix; if not, W Rite to the Free Software * Foundation, inc ., 59 temple place, Suite 330, Boston, MA 02111-1307 USA */package Org. archive. crawler. frontier; import Java. util. logging. level; import Java. util. logging. logger; import Org. apache. commons. httpclient. uriexception; import Org. archive. crawler. datamodel. candidateuri; import Org. archive. crawler. framework. crawlcontroller; import org.archive.net. uuri; import Org. archive. Net. uurifacloud;/*** queueassignmentpolicy Based on the hostname: Port evident in the given * crawluri. ** @ author gojomo * // generate a public class hostnamequeueassignmentpolicy extends queueassignmentpolicy {Private Static final logger = logger. getlogger (hostnamequeueassignmentpolicy. class. getname ();/*** when neat host-based class-key fails us */Private Static string default_linoleic Ss_key = "default... "; Private Static final string DNS =" DNS "; // multi-threaded algorithm, Public String getclasskey (crawlcontroller controller, candidateuri cauri) {string uri = cauri. getuuri (). tostring (); long hash = elfhash (URI); string a = long. tostring (hash % 100); return a;} // elfhash hash algorithm public long elfhash (string Str) {long hash = 0; long x = 0; for (INT I = 0; I <Str. length (); I ++) {hash = (Ha Sh <4) + Str. charat (I); If (x = hash & 0xf0000000l )! = 0) {hash ^ = (x> 24); hash & = ~ X ;}} return (hash & 0x7fffffff);}/* Public String getclasskey (crawler controller, candidateuri cauri) {string scheme = cauri. getuuri (). getscheme (); string candidate = NULL; try {If (scheme. equals (DNS) {// is the domain name if (cauri. getvia ()! = NULL) {// special handling for DNS: Treat as being // of the same class as the triggering Uri. // when a URI between des a port, this ensures // The DNS lookup goes atop the Host: Port // queue that triggered it, rather than // some other host queue uuri viauuri = uurifacloud. getinstance (cauri. flattenvia (); candidate = viauuri. getauthorityminususerinfo (); // adopt scheme of triggering URI scheme = Viauuri. getscheme ();} else {candidate = cauri. getuuri (). getreferencedhost () ;}} else {candidate = cauri. getuuri (). getauthorityminususerinfo ();} If (Candidate = NULL | candidate. length () = 0) {candidate = default_class_key;} catch (uriexception e) {logger. log (level. info, "unable to extract class key; using default", e); candidate = default_class_key;} If (scheme! = NULL & scheme. equals (uurifacloud. HTTPS) {// If HTTPS and no port specified, add default HTTPS port to // distinguish HTTPS from HTTP server without a port. if (! Candidate. matches (". +: [0-9] + ") {candidate + = uurifacloud. https_port ;}}// ensure classkeys are safe as filenames on NTFS return candidate. replace (':', '#'); // The domain name is basically }*/}

It turns out that the download speed has been significantly improved, at least not a single thread, basically around kb.

Now, let's introduce it first. You can write a crawler yourself later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Heritrix Summary-Rewrite hostnamequeueassignmentpolicy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Heritrix Summary-Rewrite hostnamequeueassignmentpolicy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support