Heritrix multi-thread toethread and toepool

Source: Internet
Author: User
Multithreading is required to capture webpage content more effectively and quickly. Heritrix provides a standard thread pool toepool, which is used to manage all crawling threads. Both toepool and toethread are located in the org. archive. crawler. Framework Package. As mentioned above, toepool Initialization is completed in the initialize () method of the crawler controller. Let's take a look at how toepool and toethread are initialized. The following code initializes the toepool In the crawler controller.
Constructor toepool = new toepool (this );
// Instantiate and start the thread according to the configuration in order. xml
Toepool. setsize (order. getmaxtoes (); The toepool constructor is very simple, as shown below: Public toepool (crawler lcontroller c ){
Super ("toethreads ");
This. Controller = C;
} It only calls the constructor of the parent class java. Lang. threadgroup, and assigns the injected crawler controller to the class variable. In this way, an instance of the thread pool is created. But how are the real working threads created?
The following describes the setsize (INT) method in the thread pool. In terms of name, this method is like a normal value assignment method, but in fact it is not that simple. Public void setsize (INT newsize)
{
Targetsize = newsize;
Int Difference = newsize-gettoecount ();

// If the actual number of threads in the thread pool is found to be less than the expected number
// Start a new thread
If (difference> 0 ){
For (INT I = 1; I <= difference; I ++ ){
// Start a new thread
Startnewthread ();
}
}
// If the number of threads in the thread pool has reached the requirement
Else
{

Int retainedtoes = targetsize;
// Manage the threads in the thread pool and put them in the array
Thread [] toes = This. gettoes ();

// Remove unnecessary threads cyclically
For (INT I = 0; I <toes. length; I ++ ){
If (! (Toes [I] instanceof toethread )){
Continue;
}
Retainedtoes --;
If (retainedtoes> = 0 ){
Continue;
}
Toethread TT = (toethread) Toes [I];
TT. Retire ();
}
}
}

// Obtain all threads that belong to the current thread pool
Private thread [] gettoes ()
{
Thread [] toes = new thread [activecount () + 10];
// Because toepool inherits from Java. Lang. threadgroup class
// When you call the enumerate (thread [] toes) method,
// Put all the threads opened in the threadgroup
// In the toes array for subsequent management
This. enumerate (toes );
Return toes;
}

// Start a new thread
Private synchronized void startnewthread ()
{
Toethread newthread = new toethread (this, nextserialnumber ++ );
Newthread. setpriority (default_toe_priority );
Newthread. Start ();
} The code above can conclude that the thread pool itself does not have any active thread instances when it is created. Only when its setsize method is called, to create a new thread. If the setsize method is called multiple times and different parameters are passed in, the thread pool will be based on the value set in the parameter, to determine the increase or decrease of the number of threads managed in the pool.

 

After a thread is started, the fragment in its run () method is executed. Next, let's take a look at how toethread handles the links obtained from the frontier. Public void run ()
{
String name = controller. getorder (). getcrawlordername ();
Logger. Fine (getname () + "started for order'" + name + "'");

Try {
While (true)
{
// Check whether the processing should continue
Continuecheck ();
Setstep (step_about_to_get_uri );
// Use the frontier next method from frontier
// Retrieve the next link to be processed
Crawluri Curi = controller. getfrontier (). Next ();
// Synchronize the current thread
Synchronized (this ){
Continuecheck ();
Setcurrentcuri (Curi );
}

/*
* Process the retrieved Link
*/
Processcrawluri ();
Setstep (step_about_to_return_uri );
// Check whether the processing should continue
Continuecheck ();
// Use the finished () method of frontier
// To close the link just processed
// For example, add the new link after analysis
// Go to the waiting queue
Synchronized (this ){
Controller. getfrontier (). Finished (currentcuri );
Setcurrentcuri (null );
}

// Subsequent processing
Setstep (step_finishing_process );
Lastfinishtime = system. currenttimemillis ();
// Release the link
Controller. releasecontinuepermission ();
If (shouldretire ){
Break; // from while (true)
}
}
} Catch (endedexception e ){
} Catch (exception e ){
Logger. Log (level. Severe, "Fatal exception in" + getname (), e );
} Catch (outofmemoryerror ERR ){
Seriouserror (ERR );
} Finally {
Controller. releasecontinuepermission ();
}
Setcurrentcuri (null );

// Clear cache data
This. httprecorder. closerecorders ();
This. httprecorder = NULL;
Localprocessors = NULL;

Logger. Fine (getname () + "finished for order'" + name + "'");
Setstep (step_finished );
Controller. toeended ();
Controller = NULL;
} In the above method, it clearly shows how the working thread gets the next link to be processed from the frontier and then processes the link, call the finished method of frontier to finish, release the link, clear the cache, and terminate the work in one step. In addition, there are some log operations, mainly to record the various statuses of each capture. Obviously, in the above Code, the most important line of statement processcrawluri () is the code that actually calls the processing chain to process the link.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.