API help query document http://crawler.archive.org/apidocs/
The built-in Extractor of Heritrix cannot do the necessary work well. This is not to say that it is not powerful enough, but because it often has specific needs when parsing a webpage. For example, you may only want to capture links in a certain format or text fragments in a specific format. The popular Extractor provided by Heritrix can only capture all the information. In this case, Heritrix cannot control which content should be captured or not, which leads to complicated image information and poor indexing. Here is an example to illustrate how to customize and use Extractor. This instance is actually very simple. The main function is to capture all the news on the Sohu news homepage, and the URL format is as follows.
Http://news.sohu.com/20061122/n246553333.shtml
(1) analyze this URL to know, the host part is the http://news.sohu.com serial number should be a news number, all the numbers with "n" header.
(2) With this analysis, you can determine a regular expression based on the characteristics of the URL. When the link conforms to the regular expression, it is considered as a potentially worth capturing link, which is collected to be crawled. The regular expression is as follows:
Http://news.sohu.com/#//d?#/n#//d?#.shtml
(3) In fact, all the extractors are inherited from the abstract base class org. archive. crawler. Extractor. extractor, which implements the innerProcess method internally. The following is the implementation of innerProcess:
Public void innerProcess (CrawlURI curi ){
Try {
/*
* Process links
*/
Extract (curi );
} Catch (NullPointerException npe ){
Curi. addAnnotation ("err =" + npe. getClass (). getName ());
Curi. addLocalizedError (getName (), npe ,"");
Logger. log (Level. WARNING, getName () + ": NullPointerException ",
Npe );
} Catch (StackOverflowError soe ){
Curi. addAnnotation ("err =" + soe. getClass (). getName ());
Curi. addLocalizedError (getName (), soe ,"");
Logger. log (Level. WARNING, getName () + ": StackOverflowError", soe );
} Catch (java. nio. charset. CoderMalfunctionError cme ){
Curi. addAnnotation ("err =" + cme. getClass (). getName ());
Curi. addLocalizedError (getName (), cme ,"");
Logger. log (Level. WARNING, getName () + ": CoderMalfunctionError ",
Cme );
}
}
In this method, most of the code is used to handle various exceptions and log writes during the parsing process. However, it defines a new interface extract (CrawlURI) for all the Extractor ), that is to say, after all the extractors inherit from it, you only need to implement the extract method. The following are some things to do when extending the Extractor:
(1) Write a class that inherits the base class of the Extractor.
(2) In constructors, the constructor of the parent class is called to form a complete family object.
(3) inherit the extract (curi) method.
To capture all the news links on the news.sohu.com homepage, the full source code of the developed Extractor is as follows.
Package my;
Import java. io. IOException;
Import java. util. logging. Level;
Import java. util. logging. Logger;
Import java. util. regex. Matcher;
Import java. util. regex. Pattern;
Import org. apache. commons. httpclient. URIException;
Import org. archive. crawler. datamodel. crawler;
Import org. archive. crawler. extractor. Extractor;
Import org. archive. crawler. extractor. Link;
Import org. archive. io. ReplayCharSequence;
Import org. archive. util. HttpRecorder;
Public class SohuNewsExtractor extends Extractor {
Private static Logger logger = Logger. getLogger (SohuNewsExtractor. class
. GetName ());
// Constructor
Public SohuNewsExtractor (String name ){
This (name, "Sohu News Extractor ");
}
// Constructor
Public SohuNewsExtractor (String name, String description ){
Super (name, description );
}
// The First Regular Expression used to match the SOHU news format
Public static final String PATTERN_SOHU_NEWS =
"Http://news.sohu.com/#/d?#/n#/d?#.shtml ";
// The Second Regular Expression used to match all <a href = "xxx">
Public static final String PATTERN_A_HREF =
"<A/s + href/s * =/s * (" ([^ "] *)" | [^/s>])/s *> ";
// Inherited method
Protected void extract (CrawlURI curi ){
// Convert the link object into a string
String url = curi. toString ();
/*
* The following code is used to obtain the returned string of the current link for analysis.
*/
ReplayCharSequence cs = null;
Try {
HttpRecorder hr = curi. getHttpRecorder ();
If (hr = null ){
Throw new IOException ("Why is recorder null here? ");
}
Cs = hr. getReplayCharSequence ();
} Catch (IOException e ){
Curi. addLocalizedError (this. getName (), e,
"Failed get of replay char sequence" + curi. toString ()
+ "" + E. getMessage ());
Logger. log (Level. SEVERE, "Failed get of replay char sequence in
"
+ Thread. currentThread (). getName (), e );
}
// If nothing is captured, return
If (cs = null ){
Return;
}
// Convert the content returned by the link to a string
String content = cs. toString ();
Try {
// Perform regular match on the string content
// Retrieve the link information
Pattern pattern = Pattern. compile (PATTERN_A_HREF,
Pattern. CASE_INSENSITIVE );
Matcher matcher = pattern. matcher (content );
// If a link is found
While (matcher. find ()){
String newUrl = matcher. group (2 );
// Check whether it is in SOHU news format
If (newUrl. matches (PATTERN_SOHU_NEWS )){
// If yes, add the link to the queue
// For later processing
Addlinkfromstring (Curi, newurl, "", Link. navlink_hop );
}
}
} Catch (exception e ){
E. printstacktrace ();
}
}
// Save the link for later processing
Private void addlinkfromstring (crawluri Curi, string Uri,
Charsequence context, char hoptype ){
Try {
Curi. createandaddlinkrelativetobase (Uri, context. tostring (),
Hoptype );
} Catch (uriexception e ){
If (getcontroller ()! = NULL ){
Getcontroller (). logurierror (E, Curi. getuuri (), Uri );
} Else {
Logger.info ("failed createandaddlinkrelativetobase"
+ Curi + "," + URI + "," + context + ","
+ Hoptype + ":" + E );
}
}
}
}
Note: I still don't know how to inherit the extract (curi) method. Does it inherit the method in innerprocess of Extractor? Let's take a closer look.
(1) The first step is to obtain the HTML response of the link obtained by Fetcher and convert it into a string so that it is possible to process the link on the page later.
(2) retrieve the content of all links from the page using the regular expression. Determine whether the link conforms to the Sohu news format. If yes, call the addLinkFromString () method to add the link to a queue cache for later processing.
After the Extractor class is developed, if you start Heritrix using WebUI and make it appear in the drop-down list, you need to modify the Processor. options file under the modules directory in the Eclipse project.
Open Processor. the options file shows that all the data in the drop-down list on the page is saved when the processor chain is set in WebUI. To add the SohuNewsExtractor we developed, you only need to add a row to the appropriate position. The content is as follows:
My. SohuNewsExtractor | SohuNewsExtractor
Next, start Heritrix again, create a task, go to the processor chain settings page, and you will see the self-developed Extractor.
Click "Add" to Add it to the queue.
It must be noted that it must be placed behind ExtractorHTTP to ensure that Heritrix can process relevant content in the HTTP protocol first. Similar to the process of adding a custom Extractor, developers can also customize several other processors. Similarly, you only need to find the corresponding. options file in the modules directory and add the full name of the class.