Multithreading is required to capture webpage content more effectively and quickly. Heritrix provides a standard thread pool toepool, which is used to manage all crawling threads. Both toepool and toethread are located in the org. archive. crawler. Framework Package. As mentioned above, toepool Initialization is completed in the initialize () method of the crawler controller. Let's take a look at how toepool and toethread are initialized. The following code initializes the toepool In the crawler controller.
Constructor toepool = new toepool (this );
// Instantiate and start the thread according to the configuration in order. xml
Toepool. setsize (order. getmaxtoes (); The toepool constructor is very simple, as shown below: Public toepool (crawler lcontroller c ){
Super ("toethreads ");
This. Controller = C;
} It only calls the constructor of the parent class java. Lang. threadgroup, and assigns the injected crawler controller to the class variable. In this way, an instance of the thread pool is created. But how are the real working threads created?
The following describes the setsize (INT) method in the thread pool. In terms of name, this method is like a normal value assignment method, but in fact it is not that simple. Public void setsize (INT newsize)
{
Targetsize = newsize;
Int Difference = newsize-gettoecount ();
// If the actual number of threads in the thread pool is found to be less than the expected number
// Start a new thread
If (difference> 0 ){
For (INT I = 1; I <= difference; I ++ ){
// Start a new thread
Startnewthread ();
}
}
// If the number of threads in the thread pool has reached the requirement
Else
{
Int retainedtoes = targetsize;
// Manage the threads in the thread pool and put them in the array
Thread [] toes = This. gettoes ();
// Remove unnecessary threads cyclically
For (INT I = 0; I <toes. length; I ++ ){
If (! (Toes [I] instanceof toethread )){
Continue;
}
Retainedtoes --;
If (retainedtoes> = 0 ){
Continue;
}
Toethread TT = (toethread) Toes [I];
TT. Retire ();
}
}
}
// Obtain all threads that belong to the current thread pool
Private thread [] gettoes ()
{
Thread [] toes = new thread [activecount () + 10];
// Because toepool inherits from Java. Lang. threadgroup class
// When you call the enumerate (thread [] toes) method,
// Put all the threads opened in the threadgroup
// In the toes array for subsequent management
This. enumerate (toes );
Return toes;
}
// Start a new thread
Private synchronized void startnewthread ()
{
Toethread newthread = new toethread (this, nextserialnumber ++ );
Newthread. setpriority (default_toe_priority );
Newthread. Start ();
} The code above can conclude that the thread pool itself does not have any active thread instances when it is created. Only when its setsize method is called, to create a new thread. If the setsize method is called multiple times and different parameters are passed in, the thread pool will be based on the value set in the parameter, to determine the increase or decrease of the number of threads managed in the pool.
After a thread is started, the fragment in its run () method is executed. Next, let's take a look at how toethread handles the links obtained from the frontier. Public void run ()
{
String name = controller. getorder (). getcrawlordername ();
Logger. Fine (getname () + "started for order'" + name + "'");
Try {
While (true)
{
// Check whether the processing should continue
Continuecheck ();
Setstep (step_about_to_get_uri );
// Use the frontier next method from frontier
// Retrieve the next link to be processed
Crawluri Curi = controller. getfrontier (). Next ();
// Synchronize the current thread
Synchronized (this ){
Continuecheck ();
Setcurrentcuri (Curi );
}
/*
* Process the retrieved Link
*/
Processcrawluri ();
Setstep (step_about_to_return_uri );
// Check whether the processing should continue
Continuecheck ();
// Use the finished () method of frontier
// To close the link just processed
// For example, add the new link after analysis
// Go to the waiting queue
Synchronized (this ){
Controller. getfrontier (). Finished (currentcuri );
Setcurrentcuri (null );
}
// Subsequent processing
Setstep (step_finishing_process );
Lastfinishtime = system. currenttimemillis ();
// Release the link
Controller. releasecontinuepermission ();
If (shouldretire ){
Break; // from while (true)
}
}
} Catch (endedexception e ){
} Catch (exception e ){
Logger. Log (level. Severe, "Fatal exception in" + getname (), e );
} Catch (outofmemoryerror ERR ){
Seriouserror (ERR );
} Finally {
Controller. releasecontinuepermission ();
}
Setcurrentcuri (null );
// Clear cache data
This. httprecorder. closerecorders ();
This. httprecorder = NULL;
Localprocessors = NULL;
Logger. Fine (getname () + "finished for order'" + name + "'");
Setstep (step_finished );
Controller. toeended ();
Controller = NULL;
} In the above method, it clearly shows how the working thread gets the next link to be processed from the frontier and then processes the link, call the finished method of frontier to finish, release the link, clear the cache, and terminate the work in one step. In addition, there are some log operations, mainly to record the various statuses of each capture. Obviously, in the above Code, the most important line of statement processcrawluri () is the code that actually calls the processing chain to process the link.