Add extractor to extend heritix

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

API help query document http://crawler.archive.org/apidocs/

The built-in Extractor of Heritrix cannot do the necessary work well. This is not to say that it is not powerful enough, but because it often has specific needs when parsing a webpage. For example, you may only want to capture links in a certain format or text fragments in a specific format. The popular Extractor provided by Heritrix can only capture all the information. In this case, Heritrix cannot control which content should be captured or not, which leads to complicated image information and poor indexing. Here is an example to illustrate how to customize and use Extractor. This instance is actually very simple. The main function is to capture all the news on the Sohu news homepage, and the URL format is as follows.

Http://news.sohu.com/20061122/n246553333.shtml

(1) analyze this URL to know, the host part is the http://news.sohu.com serial number should be a news number, all the numbers with "n" header.

(2) With this analysis, you can determine a regular expression based on the characteristics of the URL. When the link conforms to the regular expression, it is considered as a potentially worth capturing link, which is collected to be crawled. The regular expression is as follows:

Http://news.sohu.com/#//d?#/n#//d?#.shtml
(3) In fact, all the extractors are inherited from the abstract base class org. archive. crawler. Extractor. extractor, which implements the innerProcess method internally. The following is the implementation of innerProcess:

Public void innerProcess (CrawlURI curi ){
Try {
/*
* Process links
*/
Extract (curi );
} Catch (NullPointerException npe ){
Curi. addAnnotation ("err =" + npe. getClass (). getName ());
Curi. addLocalizedError (getName (), npe ,"");
Logger. log (Level. WARNING, getName () + ": NullPointerException ",
Npe );
} Catch (StackOverflowError soe ){
Curi. addAnnotation ("err =" + soe. getClass (). getName ());
Curi. addLocalizedError (getName (), soe ,"");
Logger. log (Level. WARNING, getName () + ": StackOverflowError", soe );
} Catch (java. nio. charset. CoderMalfunctionError cme ){
Curi. addAnnotation ("err =" + cme. getClass (). getName ());
Curi. addLocalizedError (getName (), cme ,"");
Logger. log (Level. WARNING, getName () + ": CoderMalfunctionError ",
Cme );
}
}

In this method, most of the code is used to handle various exceptions and log writes during the parsing process. However, it defines a new interface extract (CrawlURI) for all the Extractor ), that is to say, after all the extractors inherit from it, you only need to implement the extract method. The following are some things to do when extending the Extractor:

(1) Write a class that inherits the base class of the Extractor.
(2) In constructors, the constructor of the parent class is called to form a complete family object.
(3) inherit the extract (curi) method.
To capture all the news links on the news.sohu.com homepage, the full source code of the developed Extractor is as follows.

Package my;

Import java. io. IOException;
Import java. util. logging. Level;
Import java. util. logging. Logger;
Import java. util. regex. Matcher;
Import java. util. regex. Pattern;

Import org. apache. commons. httpclient. URIException;
Import org. archive. crawler. datamodel. crawler;
Import org. archive. crawler. extractor. Extractor;
Import org. archive. crawler. extractor. Link;
Import org. archive. io. ReplayCharSequence;
Import org. archive. util. HttpRecorder;

Public class SohuNewsExtractor extends Extractor {
Private static Logger logger = Logger. getLogger (SohuNewsExtractor. class
. GetName ());

// Constructor
Public SohuNewsExtractor (String name ){
This (name, "Sohu News Extractor ");
}
// Constructor
Public SohuNewsExtractor (String name, String description ){
Super (name, description );

}

// The First Regular Expression used to match the SOHU news format
Public static final String PATTERN_SOHU_NEWS =

"Http://news.sohu.com/#/d?#/n#/d?#.shtml ";

// The Second Regular Expression used to match all <a href = "xxx">
Public static final String PATTERN_A_HREF =

"<A/s + href/s * =/s * (" ([^ "] *)" | [^/s>])/s *> ";

// Inherited method
Protected void extract (CrawlURI curi ){

// Convert the link object into a string
String url = curi. toString ();

/*
* The following code is used to obtain the returned string of the current link for analysis.
*/
ReplayCharSequence cs = null;
Try {
HttpRecorder hr = curi. getHttpRecorder ();
If (hr = null ){
Throw new IOException ("Why is recorder null here? ");
}
Cs = hr. getReplayCharSequence ();
} Catch (IOException e ){
Curi. addLocalizedError (this. getName (), e,
"Failed get of replay char sequence" + curi. toString ()
+ "" + E. getMessage ());
Logger. log (Level. SEVERE, "Failed get of replay char sequence in
"
+ Thread. currentThread (). getName (), e );
}

// If nothing is captured, return
If (cs = null ){
Return;
}

// Convert the content returned by the link to a string
String content = cs. toString ();
Try {

// Perform regular match on the string content
// Retrieve the link information
Pattern pattern = Pattern. compile (PATTERN_A_HREF,
Pattern. CASE_INSENSITIVE );
Matcher matcher = pattern. matcher (content );

// If a link is found
While (matcher. find ()){
String newUrl = matcher. group (2 );
// Check whether it is in SOHU news format
If (newUrl. matches (PATTERN_SOHU_NEWS )){
// If yes, add the link to the queue
// For later processing
Addlinkfromstring (Curi, newurl, "", Link. navlink_hop );
}
}

} Catch (exception e ){
E. printstacktrace ();
}
}

// Save the link for later processing
Private void addlinkfromstring (crawluri Curi, string Uri,
Charsequence context, char hoptype ){
Try {
Curi. createandaddlinkrelativetobase (Uri, context. tostring (),
Hoptype );
} Catch (uriexception e ){
If (getcontroller ()! = NULL ){
Getcontroller (). logurierror (E, Curi. getuuri (), Uri );
} Else {
Logger.info ("failed createandaddlinkrelativetobase"
+ Curi + "," + URI + "," + context + ","
+ Hoptype + ":" + E );
}
}
}
}

Note: I still don't know how to inherit the extract (curi) method. Does it inherit the method in innerprocess of Extractor? Let's take a closer look.

(1) The first step is to obtain the HTML response of the link obtained by Fetcher and convert it into a string so that it is possible to process the link on the page later.
(2) retrieve the content of all links from the page using the regular expression. Determine whether the link conforms to the Sohu news format. If yes, call the addLinkFromString () method to add the link to a queue cache for later processing.

After the Extractor class is developed, if you start Heritrix using WebUI and make it appear in the drop-down list, you need to modify the Processor. options file under the modules directory in the Eclipse project.

Open Processor. the options file shows that all the data in the drop-down list on the page is saved when the processor chain is set in WebUI. To add the SohuNewsExtractor we developed, you only need to add a row to the appropriate position. The content is as follows:

My. SohuNewsExtractor | SohuNewsExtractor
Next, start Heritrix again, create a task, go to the processor chain settings page, and you will see the self-developed Extractor.

Click "Add" to Add it to the queue.

It must be noted that it must be placed behind ExtractorHTTP to ensure that Heritrix can process relevant content in the HTTP protocol first. Similar to the process of adding a custom Extractor, developers can also customize several other processors. Similarly, you only need to find the corresponding. options file in the modules directory and add the full name of the class.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Add extractor to extend heritix

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Add extractor to extend heritix

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support