HTML Parser--jericho-html-3.3 Decomposition table

Source: Internet
Author: User

The original part came from other blogs on the internet, just for a long time. Forget who is the reference, say sorry here.

Put some HTML pages first:


for this page I want to take out all the TD inside the text content, how to do it. Given the normal form, it's hard for me to write right to parse out the results I want.

I searched the web for a jericho-html-3.3 plugin to parse the table. It is very convenient indeed.

The code is as follows:

Package Com.xxx.hbuassys.test;import Java.net.url;import Java.util.iterator;import java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {String sourceurlstring= "Te                St.html ";        if (Sourceurlstring.indexof (': ') = =-1) sourceurlstring = "file:" +sourceurlstring;        SOURCE Source=new Source (new URL (sourceurlstring));        List elements_table=source.getallelements (htmlelementname.table); Elements_table.remove (0);//Because TABLE is nested with each other.        What we need is the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");         Segment getcontent_table = (Segment) element_table.getcontent ();   List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font);                Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}
Results:

1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn


The general idea is to take out all the table labels first, then parse the required table, remove the TR inside and remove the TD from the TR to get what we need.

Let's just say that, so it's no different from what other people on the web are saying.

Because of the need for the project, a problem was found using this plugin:

Assuming that the HTML page encoding is UTF-8 format, then the parsed content will be garbled. Suppose to encode these garbled characters directly. The use of new String (Str.getbytes (), "GBK"), and so on, do not solve this problem. I have tested it myself.

For example, the HTML page becomes:


The resulting results are:

1 =???

? Name
2 = Result
3 = time
4 = Synopsis


1 = 9???

?
2 = +fail?

???
3 = +fail
4 = 12:31
5 =?

?

??


1 = 1 cdrouter_basic_1
2 = Pass??

??
3 = 00:00
4 =?

?

??




The method used is: Change <meta http-equiv= "Content-type" content= "Text/html;charset=utf-8" > Changed to:<meta http-equiv= " Content-type "content=" TEXT/HTML;CHARSET=GBK ">

For details, the reference code is as follows:

Package Com.xxx.hbuassys.test;import Java.io.bufferedreader;import Java.io.file;import java.io.FileInputStream; Import Java.io.filereader;import java.io.inputstreamreader;import Java.net.url;import Java.util.Iterator;import Java.util.list;import Net.htmlparser.jericho.element;import Net.htmlparser.jericho.htmlelementname;import Net.htmlparser.jericho.segment;import Net.htmlparser.jericho.Source; public class htmlparser{public static void Main (string[] args) throws Exception {BufferedReader reader=new Buff Eredreader (New InputStreamReader (New FileInputStream (New File ("test.html")));//BufferedReader Reader=new    BufferedReader (New FileReader ("test.html"));    StringBuilder sbf=new StringBuilder ();    String Str=null;    while ((Str=reader.readline ())!=null) {sbf.append (str). append ("\ n"); }//To solve the Chinese garbled Method String html=sbf.tostring (). Replace ("<meta http-equiv=\" content-type\ "content=\" text/html;charset= Utf-8\ ">", "<meta http-equiv=\" content-type\ "conTent=\ "Text/html;charset=gbk\" > ");//System.out.println (HTML);        SOURCE source=new source (HTML);        List elements_table=source.getallelements (htmlelementname.table);        Elements_table.remove (0);//Because TABLE is nested with each other, we need the second one, so delete the first Iterator it_table = Elements_table.iterator (); while (It_table.hasnext ()) {Element element_table = (Element) it_table.next ();//System.out.printl            N ("* *" +element_table.tostring () + "\n**");            Segment getcontent_table = (Segment) element_table.getcontent ();            List elements_tr = getcontent_table.getallelements (htmlelementname.tr);            Iterator it_tr = Elements_tr.iterator ();                while (It_tr.hasnext ()) {Element element_tr = (Element) It_tr.next ();                Segment getcontent_tr = (Segment) element_tr.getcontent ();                List Elements_font = getcontent_tr.getallelements (Htmlelementname.font); Iterator It_font = Elements_font.iterator ();                int i = 1;                    while (It_font.hasnext ()) {Element Element_font = (Element) It_font.next ();                    Segment Getcontent_font = (Segment) element_font.getcontent ();                    String a1 = getcontent_font.tostring ();                    SYSTEM.OUT.PRINTLN (i + "=" + element_font.getcontent (). Gettextextractor (). toString ());                i++;            } System.out.println (); }        }    }}

The results are as follows:

1 = want to learn Name
2 = Result
3 = time
4 = Synopsis


1 = 9 Want to learn
2 = +fail want to learn
3 = +fail
4 = 12:31
5 = want to learn


1 = 1 cdrouter_basic_1
2 = Pass wants to learn
3 = 00:00
4 = want to learn




Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.

HTML Parser--jericho-html-3.3 Decomposition table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.