HTML Parser--jericho-html-3.3 Parse table

Source: Internet
Author: User

The original part of the content from the online other blog, but because of the time, forget the reference is who's, here say sorry.

Post an HTML page first:


for this page I want to take out all the TD inside the text content, how to do it, if using regular expression, I really difficult to write the right, to parse out the results I want.

Search the internet for a bit jericho-html-3.3 this plugin, used to parse table, it is really convenient.

The code is as follows:

Package Com.xxx.hbuassys.test;import Java.net.url;import Java.util.iterator;import java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {String sourceurlstring= "Te                St.html ";        if (Sourceurlstring.indexof (': ') = =-1) sourceurlstring = "file:" +sourceurlstring;        SOURCE Source=new Source (new URL (sourceurlstring));        List elements_table=source.getallelements (htmlelementname.table);        Elements_table.remove (0);//Because TABLE is nested with each other, we need the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");         Segment getcontent_table = (Segment) element_table.getcontent ();   List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font);                Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}
Results:

1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn


The general idea is to take out all the table tags first, then parse the required table, remove the TR inside, and take the TD out of the TR to get what we need.

If that's the case, then it's no different from what other people on the internet are saying.

Because of the needs of the project, a problem was found using this plugin:

If the HTML page encoding is UTF-8 format, then parse out the content will be garbled, if directly to these garbled code, the use of new String (Str.getbytes (), "GBK"), such as the operation can not solve the problem, I personally tested.

For example, the HTML page becomes:


The resulting results are:

1 =???? Name
2 = Result
3 = time
4 = Synopsis


1 = 9????
2 = +fail????
3 = +fail
4 = 12:31
5 =????


1 = 1 cdrouter_basic_1
2 = Pass????
3 = 00:00
4 =????


The method used is: Change <meta http-equiv= "Content-type" content= "Text/html;charset=utf-8" > Changed to:<meta http-equiv= " Content-type "content=" TEXT/HTML;CHARSET=GBK ">

For more information, refer to the following code:

Package Com.xxx.hbuassys.test;import Java.io.bufferedreader;import Java.io.file;import java.io.FileInputStream; Import Java.io.filereader;import java.io.inputstreamreader;import Java.net.url;import Java.util.Iterator;import Java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {BufferedReader reader=new Buff Eredreader (New InputStreamReader (New FileInputStream (New File ("test.html")));//BufferedReader Reader=new    BufferedReader (New FileReader ("test.html"));    StringBuilder sbf=new StringBuilder ();    String Str=null;    while ((Str=reader.readline ())!=null) {sbf.append (str). append ("\ n"); }//To solve the Chinese garbled Method String html=sbf.tostring (). Replace ("<meta http-equiv=\" content-type\ "content=\" text/html;charset= Utf-8\ ">", "<meta http-equiv=\" content-type\ "conTent=\ "Text/html;charset=gbk\" > ");//System.out.println (HTML);        SOURCE source=new source (HTML);        List elements_table=source.getallelements (htmlelementname.table);        Elements_table.remove (0);//Because TABLE is nested with each other, we need the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");            Segment getcontent_table = (Segment) element_table.getcontent ();            List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font); Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}

The results are as follows:

1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.