Heritrix 3.1.0 source code parsing (13)

Source: Internet
Author: User

Next, we will analyze the void finished (crawler Luri Curi) method of the bdbfrontier class to finish the end scanning of the crawler Luri object.

In the parent class abstractfrontier of the bdbfrontier class parent class

Org. archive. crawler. frontier. bdbfrontier

Org. archive. crawler. frontier. abstractfrontier

/**     * Note that the previously emitted CrawlURI has completed     * its processing (for now).     *     * The CrawlURI may be scheduled to retry, if appropriate,     * and other related URIs may become eligible for release     * via the next next() call, as a result of finished().     *     *  (non-Javadoc)     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)     */    public void finished(CrawlURI curi) {        try {            KeyedProperties.loadOverridesFrom(curi);            processFinish(curi);        } finally {            KeyedProperties.clearOverridesFrom(curi);         }    }

Continue to call the void processfinish (crawluri Curi) method of the bdbfrontier class, in the parent workqueuefrontier of the bdbfrontier class

Org. archive. crawler. frontier. bdbfrontier

Org. archive. crawler. frontier. workqueuefrontier

/**     * Note that the previously emitted CrawlURI has completed     * its processing (for now).     *     * The CrawlURI may be scheduled to retry, if appropriate,     * and other related URIs may become eligible for release     * via the next next() call, as a result of finished().     *     * TODO: make as many decisions about what happens to the CrawlURI     * (success, failure, retry) and queue (retire, snooze, ready) as      * possible elsewhere, such as in DispositionProcessor. Then, break     * this into simple branches or focused methods for each case.      *       * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)     */    protected void processFinish(CrawlURI curi) {//        assert Thread.currentThread() == managerThread;                long now = System.currentTimeMillis();        curi.incrementFetchAttempts();        logNonfatalErrors(curi);                WorkQueue wq = (WorkQueue) curi.getHolder();        // always refresh budgeting values from current curi        // (whose overlay settings should be active here)        wq.setSessionBudget(getBalanceReplenishAmount());        wq.setTotalBudget(getQueueTotalBudget());                assert (wq.peek(this) == curi) : "unexpected peek " + wq;        int holderCost = curi.getHolderCost();        if (needsReenqueuing(curi)) {            // codes/errors which don't consume the URI, leaving it atop queue            if(curi.getFetchStatus()!=S_DEFERRED) {                wq.expend(holderCost); // all retries but DEFERRED cost            }            long delay_ms = retryDelayFor(curi) * 1000;            curi.processingCleanup(); // lose state that shouldn't burden retry            wq.unpeek(curi);            wq.update(this, curi); // rewrite any changes            handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));            doJournalReenqueued(curi);            wq.makeDirty();            return; // no further dequeueing, logging, rescheduling to occur        }        // Curi will definitely be disposed of without retry, so remove from queue        wq.dequeue(this,curi);        decrementQueuedCount(1);        largestQueues.update(wq.getClassKey(), wq.getCount());        log(curi);                if (curi.isSuccess()) {            // codes deemed 'success'             incrementSucceededFetchCount();            totalProcessedBytes.addAndGet(curi.getRecordedSize());            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));            doJournalFinishedSuccess(curi);                   } else if (isDisregarded(curi)) {            // codes meaning 'undo' (even though URI was enqueued,             // we now want to disregard it from normal success/failure tallies)            // (eg robots-excluded, operator-changed-scope, etc)            incrementDisregardedUriCount();            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));            holderCost = 0; // no charge for disregarded URIs            // TODO: consider reinstating forget-URI capability, so URI could be            // re-enqueued if discovered again            doJournalDisregarded(curi);                    } else {            // codes meaning 'failure'            incrementFailedFetchCount();            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));            // if exception, also send to crawlErrors            if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {                Object[] array = { curi };                loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()                        .toString(), array);            }                    // charge queue any extra error penalty            wq.noteError(getErrorPenaltyAmount());            doJournalFinishedFailure(curi);                    }        wq.expend(holderCost); // successes & failures charge cost to queue                long delay_ms = curi.getPolitenessDelay();        handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);        wq.makeDirty();                if(curi.getRescheduleTime()>0) {            // marked up for forced-revisit at a set time            curi.processingCleanup();            curi.resetForRescheduling();             futureUris.put(curi.getRescheduleTime(),curi);            futureUriCount.incrementAndGet();         } else {            curi.stripToMinimal();            curi.processingCleanup();        }    }

First, obtain the holder attribute of the crawler Curi (the classkey Of The crawler Curi object is worth the bdbworkqueue object, which involves the scheduling of the heritrix3.1.0 working queue, which will be analyzed later ),

Call the synchronized void dequeue (final workqueuefrontier frontier, crawluri expected) method of the bdbworkqueue object.

Org. archive. crawler. frontier. bdbworkqueue

Org. archive. crawler. frontier. workqueue

/**     * Remove the peekItem from the queue and adjusts the count.     *      * @param frontier  Work queues manager.     */    protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) {        try {            deleteItem(frontier, peekItem);        } catch (IOException e) {            //FIXME better exception handling            e.printStackTrace();            throw new RuntimeException(e);        }        unpeek(expected);        count--;        lastDequeueTime = System.currentTimeMillis();    }

Org. archive. crawler. frontier. bdbworkqueue

protected void deleteItem(final WorkQueueFrontier frontier,            final CrawlURI peekItem) throws IOException {        try {            final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)                .getWorkQueues();             queues.delete(peekItem);        } catch (DatabaseException e) {            throw new IOException(e);        }    }

Finally, call the void Delete (crawler Luri item) method of the bdbmultipleworkqueues object. This method has been mentioned in the previous article and will not be repeated here.

---------------------------------------------------------------------------

This series of heritrix 3.1.0 source code parsing is self-original

Reprinted please indicate the source of the blog garden hedgehog gentle

This article link http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.