Next, we will analyze the void finished (crawler Luri Curi) method of the bdbfrontier class to finish the end scanning of the crawler Luri object.
In the parent class abstractfrontier of the bdbfrontier class parent class
Org. archive. crawler. frontier. bdbfrontier
Org. archive. crawler. frontier. abstractfrontier
/** * Note that the previously emitted CrawlURI has completed * its processing (for now). * * The CrawlURI may be scheduled to retry, if appropriate, * and other related URIs may become eligible for release * via the next next() call, as a result of finished(). * * (non-Javadoc) * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI) */ public void finished(CrawlURI curi) { try { KeyedProperties.loadOverridesFrom(curi); processFinish(curi); } finally { KeyedProperties.clearOverridesFrom(curi); } }
Continue to call the void processfinish (crawluri Curi) method of the bdbfrontier class, in the parent workqueuefrontier of the bdbfrontier class
Org. archive. crawler. frontier. bdbfrontier
Org. archive. crawler. frontier. workqueuefrontier
/** * Note that the previously emitted CrawlURI has completed * its processing (for now). * * The CrawlURI may be scheduled to retry, if appropriate, * and other related URIs may become eligible for release * via the next next() call, as a result of finished(). * * TODO: make as many decisions about what happens to the CrawlURI * (success, failure, retry) and queue (retire, snooze, ready) as * possible elsewhere, such as in DispositionProcessor. Then, break * this into simple branches or focused methods for each case. * * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI) */ protected void processFinish(CrawlURI curi) {// assert Thread.currentThread() == managerThread; long now = System.currentTimeMillis(); curi.incrementFetchAttempts(); logNonfatalErrors(curi); WorkQueue wq = (WorkQueue) curi.getHolder(); // always refresh budgeting values from current curi // (whose overlay settings should be active here) wq.setSessionBudget(getBalanceReplenishAmount()); wq.setTotalBudget(getQueueTotalBudget()); assert (wq.peek(this) == curi) : "unexpected peek " + wq; int holderCost = curi.getHolderCost(); if (needsReenqueuing(curi)) { // codes/errors which don't consume the URI, leaving it atop queue if(curi.getFetchStatus()!=S_DEFERRED) { wq.expend(holderCost); // all retries but DEFERRED cost } long delay_ms = retryDelayFor(curi) * 1000; curi.processingCleanup(); // lose state that shouldn't burden retry wq.unpeek(curi); wq.update(this, curi); // rewrite any changes handleQueue(wq,curi.includesRetireDirective(),now,delay_ms); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY)); doJournalReenqueued(curi); wq.makeDirty(); return; // no further dequeueing, logging, rescheduling to occur } // Curi will definitely be disposed of without retry, so remove from queue wq.dequeue(this,curi); decrementQueuedCount(1); largestQueues.update(wq.getClassKey(), wq.getCount()); log(curi); if (curi.isSuccess()) { // codes deemed 'success' incrementSucceededFetchCount(); totalProcessedBytes.addAndGet(curi.getRecordedSize()); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED)); doJournalFinishedSuccess(curi); } else if (isDisregarded(curi)) { // codes meaning 'undo' (even though URI was enqueued, // we now want to disregard it from normal success/failure tallies) // (eg robots-excluded, operator-changed-scope, etc) incrementDisregardedUriCount(); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED)); holderCost = 0; // no charge for disregarded URIs // TODO: consider reinstating forget-URI capability, so URI could be // re-enqueued if discovered again doJournalDisregarded(curi); } else { // codes meaning 'failure' incrementFailedFetchCount(); appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED)); // if exception, also send to crawlErrors if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) { Object[] array = { curi }; loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI() .toString(), array); } // charge queue any extra error penalty wq.noteError(getErrorPenaltyAmount()); doJournalFinishedFailure(curi); } wq.expend(holderCost); // successes & failures charge cost to queue long delay_ms = curi.getPolitenessDelay(); handleQueue(wq,curi.includesRetireDirective(),now,delay_ms); wq.makeDirty(); if(curi.getRescheduleTime()>0) { // marked up for forced-revisit at a set time curi.processingCleanup(); curi.resetForRescheduling(); futureUris.put(curi.getRescheduleTime(),curi); futureUriCount.incrementAndGet(); } else { curi.stripToMinimal(); curi.processingCleanup(); } }
First, obtain the holder attribute of the crawler Curi (the classkey Of The crawler Curi object is worth the bdbworkqueue object, which involves the scheduling of the heritrix3.1.0 working queue, which will be analyzed later ),
Call the synchronized void dequeue (final workqueuefrontier frontier, crawluri expected) method of the bdbworkqueue object.
Org. archive. crawler. frontier. bdbworkqueue
Org. archive. crawler. frontier. workqueue
/** * Remove the peekItem from the queue and adjusts the count. * * @param frontier Work queues manager. */ protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) { try { deleteItem(frontier, peekItem); } catch (IOException e) { //FIXME better exception handling e.printStackTrace(); throw new RuntimeException(e); } unpeek(expected); count--; lastDequeueTime = System.currentTimeMillis(); }
Org. archive. crawler. frontier. bdbworkqueue
protected void deleteItem(final WorkQueueFrontier frontier, final CrawlURI peekItem) throws IOException { try { final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier) .getWorkQueues(); queues.delete(peekItem); } catch (DatabaseException e) { throw new IOException(e); } }
Finally, call the void Delete (crawler Luri item) method of the bdbmultipleworkqueues object. This method has been mentioned in the previous article and will not be repeated here.
---------------------------------------------------------------------------
This series of heritrix 3.1.0 source code parsing is self-original
Reprinted please indicate the source of the blog garden hedgehog gentle
This article link http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html