Java implementation of template-based Web page structured information Precision extraction components: Htmlextractor

Source: Internet
Author: User

Htmlextractor is a Java-implemented template-based Web page structured Information extraction component , which does not contain crawlers, but can be called by crawlers or other programs to more accurately extract the structured information of a webpage.

Htmlextractor is designed for large-scale distributed environment, adopt the Master-slave architecture, the main node is responsible for maintaining the extraction rules, from the node to the Master node request extraction rules, when the extraction rules change, the master node active notification from the node, so as to achieve the real-time dynamic effect of the extraction rule changes. How do I use it?
Htmlextractor is comprised of 2 sub-items, Html-extractor and Html-extractor-web. Html-extractor implements the data extraction logic, is from the node, Html-extractor-web provides the web interface to maintain the extraction rules, is the master node. Html-extractor is a jar package that can be referenced by maven:

<dependency> <groupId>org.apdplat</groupId> <artifactid>html-extractor</artifactid > <version>1.0</version></dependency>

Html-extractor-web is a war package that needs to be deployed to the servlet/jsp container.

single-use centralized method:
1, the Construction Extraction Rule list<urlpattern> urlpatterns = new arraylist<> ();//1.1, Constructs the URL pattern urlpattern urlpattern = new urlpattern (); Urlpattern.seturlpattern ("http:// Money.163.com/\\d{2}/\\d{4}/\\d{2}/[0-9a-z]{16}.html ");//1.2, constructs HTML template htmltemplate htmltemplate =  new htmltemplate (); Htmltemplate.settemplatename ("NetEase Finance Channel"); Htmltemplate.settablename ("Finance");// 1.3, the URL pattern and the HTML template to establish the association Urlpattern.addhtmltemplate (htmltemplate);//1.4, constructs the CSS path csspath csspath =  New csspath (); Csspath.setcsspath ("H1"); Csspath.setfieldname ("title"); Csspath.setfielddescription ("caption");// 1.5, the CSS path and template to establish the association Htmltemplate.addcsspath (Csspath),//1.6, constructs the CSS path Csspath = new csspath (); Csspath.setcsspath ("Div#endtext"); Csspath.setfieldname ("content"); Csspath.setfielddescription ("body");//1.7, The CSS path and template are associated Htmltemplate.addcsspath (csspath);//You can construct multiple Urlurl modes Urlpatterns.add (Urlpattern) as above;//2, Gets the extraction Rule object Extractregular extractregular = extractreGular.getinstance (urlpatterns);///Note: You can dynamically change the extraction Rule//extractregular.addurlpatterns (urlpatterns) by the following 3 methods;// Extractregular.addurlpattern (Urlpattern);//extractregular.removeurlpattern (Urlpattern.geturlpattern ());//3, Get the HTML extraction tool htmlextractor htmlextractor = htmlextractor.getinstance (extractregular);//4, Extract webpage string url =  "http://money.163.com/08/1219/16/4THR2TMP002533QK.html"; List<extractresult> extractresults = htmlextractor.extract (url,  "gb2312");//5, Output int i = 1;for  (extractresult extractresult : extractresults)  {     system.out.println (i++)  +  ", Web  "  + extractresult.geturl ()  +  "  extraction results");     for (extractresultitem extractresultitem :  Extractresult.getextractresultitems ()) {        system.out.print ("\ T" + Extractresultitem.getfield () + " = " +extractresultitem.getvaLue ());                   }    system.out.println ("\tdescription = " +extractResult.getDescription ()) ;     system.out.println ("\tkeywords = " +extractresult.getkeywords ());}

 

Multi-Machine distributed use method:
1. Run the master node, responsible for maintaining the extraction rules: Html-extractor-web The subproject into a war package and deploy it to Tomcat. 2. Get an instance of Htmlextractor (from the node), the sample code is as follows:

String Allextractregularurl = "http://localhost:8080/HtmlExtractorServer/api/all_extract_regular.jsp"; String redishost = "localhost"; int redisport = 6379; Htmlextractor htmlextractor = htmlextractor.getinstance (Allextractregularurl, Redishost, RedisPort);

3, extract the information, the sample code is as follows:

string url =  "http://money.163.com/08/1219/16/4THR2TMP002533QK.html"; List<extractresult> extractresults = htmlextractor.extract (url,  "gb2312"); int i  = 1;for  (extractresult extractresult : extractresults)  {     system.out.println (i++)  + , Web    + extractresult.geturl ()  +    ");     for (extractresultitem extractresultitem :  Extractresult.getextractresultitems ()) {        system.out.print ("\ T" + Extractresultitem.getfield () + " = " +extractresultitem.getvalue ());                   }     System.out.println ("\tdescription = " +extractresult.getdescription ());     System.out.println ("\tkeywords = " +extractresult.getkeywords ());}



Java implementation of template-based Web page structured information Precision extraction components: Htmlextractor

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.