HTML Parser--jericho-html-3.3 Parse table

Last Update:2014-06-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The original part of the content from the online other blog, but because of the time, forget the reference is who's, here say sorry.

Post an HTML page first:


for this page I want to take out all the TD inside the text content, how to do it, if using regular expression, I really difficult to write the right, to parse out the results I want. 
Search the internet for a bit jericho-html-3.3 this plugin, used to parse table, it is really convenient.
The code is as follows:
Package Com.xxx.hbuassys.test;import Java.net.url;import Java.util.iterator;import java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {String sourceurlstring= "Te                St.html ";        if (Sourceurlstring.indexof (': ') = =-1) sourceurlstring = "file:" +sourceurlstring;        SOURCE Source=new Source (new URL (sourceurlstring));        List elements_table=source.getallelements (htmlelementname.table);        Elements_table.remove (0);//Because TABLE is nested with each other, we need the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");         Segment getcontent_table = (Segment) element_table.getcontent ();   List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font);                Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}
Results:
1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn


The general idea is to take out all the table tags first, then parse the required table, remove the TR inside, and take the TD out of the TR to get what we need.
If that's the case, then it's no different from what other people on the internet are saying.
Because of the needs of the project, a problem was found using this plugin:
If the HTML page encoding is UTF-8 format, then parse out the content will be garbled, if directly to these garbled code, the use of new String (Str.getbytes (), "GBK"), such as the operation can not solve the problem, I personally tested.
For example, the HTML page becomes:

The resulting results are:
1 =???? Name
2 = Result
3 = time
4 = Synopsis


1 = 9????
2 = +fail????
3 = +fail
4 = 12:31
5 =????


1 = 1 cdrouter_basic_1
2 = Pass????
3 = 00:00
4 =????


The method used is: Change <meta http-equiv= "Content-type" content= "Text/html;charset=utf-8" > Changed to:<meta http-equiv= " Content-type "content=" TEXT/HTML;CHARSET=GBK ">
For more information, refer to the following code:
Package Com.xxx.hbuassys.test;import Java.io.bufferedreader;import Java.io.file;import java.io.FileInputStream; Import Java.io.filereader;import java.io.inputstreamreader;import Java.net.url;import Java.util.Iterator;import Java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {BufferedReader reader=new Buff Eredreader (New InputStreamReader (New FileInputStream (New File ("test.html")));//BufferedReader Reader=new    BufferedReader (New FileReader ("test.html"));    StringBuilder sbf=new StringBuilder ();    String Str=null;    while ((Str=reader.readline ())!=null) {sbf.append (str). append ("\ n"); }//To solve the Chinese garbled Method String html=sbf.tostring (). Replace ("<meta http-equiv=\" content-type\ "content=\" text/html;charset= Utf-8\ ">", "<meta http-equiv=\" content-type\ "conTent=\ "Text/html;charset=gbk\" > ");//System.out.println (HTML);        SOURCE source=new source (HTML);        List elements_table=source.getallelements (htmlelementname.table);        Elements_table.remove (0);//Because TABLE is nested with each other, we need the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");            Segment getcontent_table = (Segment) element_table.getcontent ();            List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font); Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}

The results are as follows:
1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HTML Parser--jericho-html-3.3 Parse table

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HTML Parser--jericho-html-3.3 Parse table

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support