HTML Parser--jericho-html-3.3 Decomposition table

Last Update:2015-10-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The original part came from other blogs on the internet, just for a long time. Forget who is the reference, say sorry here.

Put some HTML pages first:


for this page I want to take out all the TD inside the text content, how to do it. Given the normal form, it's hard for me to write right to parse out the results I want. 
I searched the web for a jericho-html-3.3 plugin to parse the table. It is very convenient indeed.
The code is as follows:
Package Com.xxx.hbuassys.test;import Java.net.url;import Java.util.iterator;import java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {String sourceurlstring= "Te                St.html ";        if (Sourceurlstring.indexof (': ') = =-1) sourceurlstring = "file:" +sourceurlstring;        SOURCE Source=new Source (new URL (sourceurlstring));        List elements_table=source.getallelements (htmlelementname.table); Elements_table.remove (0);//Because TABLE is nested with each other.        What we need is the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");         Segment getcontent_table = (Segment) element_table.getcontent ();   List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font);                Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}
Results:
1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn


The general idea is to take out all the table labels first, then parse the required table, remove the TR inside and remove the TD from the TR to get what we need.
Let's just say that, so it's no different from what other people on the web are saying.
Because of the need for the project, a problem was found using this plugin:
Assuming that the HTML page encoding is UTF-8 format, then the parsed content will be garbled. Suppose to encode these garbled characters directly. The use of new String (Str.getbytes (), "GBK"), and so on, do not solve this problem. I have tested it myself.
For example, the HTML page becomes:

The resulting results are:
1 =???
? Name
2 = Result
3 = time
4 = Synopsis


1 = 9???
?
2 = +fail?
???
3 = +fail
4 = 12:31
5 =?
?
??


1 = 1 cdrouter_basic_1
2 = Pass??
??
3 = 00:00
4 =?
?
??



The method used is: Change <meta http-equiv= "Content-type" content= "Text/html;charset=utf-8" > Changed to:<meta http-equiv= " Content-type "content=" TEXT/HTML;CHARSET=GBK ">
For details, the reference code is as follows:
Package Com.xxx.hbuassys.test;import Java.io.bufferedreader;import Java.io.file;import java.io.FileInputStream; Import Java.io.filereader;import java.io.inputstreamreader;import Java.net.url;import Java.util.Iterator;import Java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {BufferedReader reader=new Buff Eredreader (New InputStreamReader (New FileInputStream (New File ("test.html")));//BufferedReader Reader=new    BufferedReader (New FileReader ("test.html"));    StringBuilder sbf=new StringBuilder ();    String Str=null;    while ((Str=reader.readline ())!=null) {sbf.append (str). append ("\ n"); }//To solve the Chinese garbled Method String html=sbf.tostring (). Replace ("<meta http-equiv=\" content-type\ "content=\" text/html;charset= Utf-8\ ">", "<meta http-equiv=\" content-type\ "conTent=\ "Text/html;charset=gbk\" > ");//System.out.println (HTML);        SOURCE source=new source (HTML);        List elements_table=source.getallelements (htmlelementname.table);        Elements_table.remove (0);//Because TABLE is nested with each other, we need the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");            Segment getcontent_table = (Segment) element_table.getcontent ();            List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font); Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}

The results are as follows:
1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn




 
Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.
 
HTML Parser--jericho-html-3.3 Decomposition table

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HTML Parser--jericho-html-3.3 Decomposition table

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HTML Parser--jericho-html-3.3 Decomposition table

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support