Java implementation of template-based Web page structured information Precision extraction components: Htmlextractor

Last Update:2014-08-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Htmlextractor is a Java-implemented template-based Web page structured Information extraction component , which does not contain crawlers, but can be called by crawlers or other programs to more accurately extract the structured information of a webpage.

Htmlextractor is designed for large-scale distributed environment, adopt the Master-slave architecture, the main node is responsible for maintaining the extraction rules, from the node to the Master node request extraction rules, when the extraction rules change, the master node active notification from the node, so as to achieve the real-time dynamic effect of the extraction rule changes. How do I use it?

Htmlextractor is comprised of 2 sub-items, Html-extractor and Html-extractor-web. Html-extractor implements the data extraction logic, is from the node, Html-extractor-web provides the web interface to maintain the extraction rules, is the master node. Html-extractor is a jar package that can be referenced by maven:

<dependency> <groupId>org.apdplat</groupId> <artifactid>html-extractor</artifactid > <version>1.0</version></dependency>

Html-extractor-web is a war package that needs to be deployed to the servlet/jsp container.

single-use centralized method:

1, the Construction Extraction Rule list<urlpattern> urlpatterns = new arraylist<> ();//1.1, Constructs the URL pattern urlpattern urlpattern = new urlpattern (); Urlpattern.seturlpattern ("http:// Money.163.com/\\d{2}/\\d{4}/\\d{2}/[0-9a-z]{16}.html ");//1.2, constructs HTML template htmltemplate htmltemplate =  new htmltemplate (); Htmltemplate.settemplatename ("NetEase Finance Channel"); Htmltemplate.settablename ("Finance");// 1.3, the URL pattern and the HTML template to establish the association Urlpattern.addhtmltemplate (htmltemplate);//1.4, constructs the CSS path csspath csspath =  New csspath (); Csspath.setcsspath ("H1"); Csspath.setfieldname ("title"); Csspath.setfielddescription ("caption");// 1.5, the CSS path and template to establish the association Htmltemplate.addcsspath (Csspath),//1.6, constructs the CSS path Csspath = new csspath (); Csspath.setcsspath ("Div#endtext"); Csspath.setfieldname ("content"); Csspath.setfielddescription ("body");//1.7, The CSS path and template are associated Htmltemplate.addcsspath (csspath);//You can construct multiple Urlurl modes Urlpatterns.add (Urlpattern) as above;//2, Gets the extraction Rule object Extractregular extractregular = extractreGular.getinstance (urlpatterns);///Note: You can dynamically change the extraction Rule//extractregular.addurlpatterns (urlpatterns) by the following 3 methods;// Extractregular.addurlpattern (Urlpattern);//extractregular.removeurlpattern (Urlpattern.geturlpattern ());//3, Get the HTML extraction tool htmlextractor htmlextractor = htmlextractor.getinstance (extractregular);//4, Extract webpage string url =  "http://money.163.com/08/1219/16/4THR2TMP002533QK.html"; List<extractresult> extractresults = htmlextractor.extract (url,  "gb2312");//5, Output int i = 1;for  (extractresult extractresult : extractresults)  {     system.out.println (i++)  +  ", Web  "  + extractresult.geturl ()  +  "  extraction results");     for (extractresultitem extractresultitem :  Extractresult.getextractresultitems ()) {        system.out.print ("\ T" + Extractresultitem.getfield () + " = " +extractresultitem.getvaLue ());                   }    system.out.println ("\tdescription = " +extractResult.getDescription ()) ;     system.out.println ("\tkeywords = " +extractresult.getkeywords ());}

Multi-Machine distributed use method:

1. Run the master node, responsible for maintaining the extraction rules: Html-extractor-web The subproject into a war package and deploy it to Tomcat. 2. Get an instance of Htmlextractor (from the node), the sample code is as follows:

String Allextractregularurl = "http://localhost:8080/HtmlExtractorServer/api/all_extract_regular.jsp"; String redishost = "localhost"; int redisport = 6379; Htmlextractor htmlextractor = htmlextractor.getinstance (Allextractregularurl, Redishost, RedisPort);

3, extract the information, the sample code is as follows:

string url =  "http://money.163.com/08/1219/16/4THR2TMP002533QK.html"; List<extractresult> extractresults = htmlextractor.extract (url,  "gb2312"); int i  = 1;for  (extractresult extractresult : extractresults)  {     system.out.println (i++)  + , Web    + extractresult.geturl ()  +    ");     for (extractresultitem extractresultitem :  Extractresult.getextractresultitems ()) {        system.out.print ("\ T" + Extractresultitem.getfield () + " = " +extractresultitem.getvalue ());                   }     System.out.println ("\tdescription = " +extractresult.getdescription ());     System.out.println ("\tkeywords = " +extractresult.getkeywords ());}

Java implementation of template-based Web page structured information Precision extraction components: Htmlextractor

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java implementation of template-based Web page structured information Precision extraction components: Htmlextractor

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java implementation of template-based Web page structured information Precision extraction components: Htmlextractor

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support