Htmlparser use Details (2)-node content

Source: Internet
Author: User
Tags gettext

Htmlparser Use Details (2)-Node content 2010-03-18 13:41htmlparser The parsed information into the structure of a tree. node is the basis of data type for information retention.
See node's definition:
Public Interface Node extends cloneable;

There are several types of essentials that are included in node:
1, to deal with the tree-type structure TraverseFunctions that are most easily understood:
Node getParent (): Get parent node
NodeList GetChildren (): Get a list of child nodes
Node Getfirstchild (): Gets the first child node
Node Getlastchild (): Gets the last child node
Node getprevioussibling (): Get the former brother (not shy, English is brothers and sisters, literal translation is too troublesome and inconsistent with the habit, sorry female compatriots)
Node getnextsibling (): Get Next sibling node
2. Get node contentThe function:
String GetText (): Get text
String toplaintextstring (): Gets plain text information.
String toHtml (): Get HTML information (original HTML)
String toHtml (Boolean verbatim): Get HTML information (original HTML)
String toString (): Get string information (original HTML)
Page GetPage (): Gets the Page object that this node corresponds to
int getstartposition (): Gets the start position of this node in the HTML page
int getendposition (): Gets the end position of this node in the HTML page
3. functions for filter filtering
void Collectinto (NodeList list, Nodefilter filter): Filters based on filter conditions, nodes with matching criteria are placed in the list.
4, for Visitor TraversalThe function:
void Accept (Nodevisitor visitor): Apply visitor to this node
5, for Modify ContentFunctions, which are used less:
void Setpage (Page page): Sets the Page object that this node corresponds to
void SetText (String text): Set text
void Setchildren (NodeList children): Set child node List
6. Other functions:
void Dosemanticaction (): Performs this node-corresponding manipulation (only a few tags have corresponding manipulations)
Object Clone (): An abstract function of the interface clone.

In fact, we use htmlparser most of the processing of HTML pages, filter or visitor related functions are necessary, and then the first class and the second class function is the most used. The first kind of function is relatively easy to understand, the following example illustrates the second class of functions.
Here is the HTML file for testing:
<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">
<body >
<div id= "Top_main" >
<div id= "Logoindex" >
<!--This is a comment--
Shirasawa Ju-www.baizeju.com
<a href= "http://www.baizeju.com" > Shirasawa ju-www.baizeju.com</a>
</div>
Shirasawa Ju-www.baizeju.com
</div>
</body>

Test the source code:
/**
* @author www.baizeju.com
*/
Package com.baizeju.htmlparsertester;
Import Java.io.BufferedReader;
Import Java.io.InputStreamReader;
Import Java.io.FileInputStream;
Import Java.io.File;
Import java.net.HttpURLConnection;
Import Java.net.URL;
Import Org.htmlparser.Node;
Import Org.htmlparser.util.NodeIterator;
Import Org.htmlparser.Parser;

/**
* @author www.baizeju.com
*/
public class Main {
private static String ENCODE = "GBK";
private static voidmessage(String szmsg) {
try{System.out.println (New String (Szmsg.getbytes (ENCODE), System.getproperty ("file.encoding"))); catch (Exception e) {}}
public static StringOpenFile(String szFileName) {
try {
BufferedReader bis = new BufferedReader (new InputStreamReader (New FileInputStream (New File (szFileName)), ENCODE));
String szcontent= "";
String sztemp;
while ((sztemp = Bis.readline ()) = null) {
szcontent+=sztemp+ "\ n"; }
Bis.close ();
return szcontent;
}
catch (Exception e) {
Return "";
}
}

public static void Main (string[] args) {
Try{
Parser Parser = new Parser ((httpurlconnection) (New URL ("http://127.0.0.1:8080/HTMLParserTester.html")). OpenConnection ());
for (Nodeiterator i = parser.elements (); I.hasmorenodes ();) {
Node node = I.nextnode ();
Message ("GetText:" +node.gettext ());
Message ("Getplaintext:" +node.toplaintextstring ());
Message ("ToHtml:" +node.tohtml ());
Message ("ToHtml (true):" +node.tohtml (true));
Message ("ToHtml (false):" +node.tohtml (false));
Message ("ToString:" +node.tostring ());
Message ("=================================================");
}
}
catch (Exception e) {
System.out.println ("Exception:" +e);
}
}
}

Output Result:
gettext:! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd "
Getplaintext:
tohtml:<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">
ToHtml (true): <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">
ToHtml (false): <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">
Tostring:doctype Tag:! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ; Begins at:0; Ends at:121
=================================================
GetText:
Getplaintext:
ToHtml:
ToHtml (True):
ToHtml (FALSE):
Tostring:txt (121[0,121],123[1,0]): \ n
=================================================
Gettext:head
Getplaintext: Shirasawa ju-www.baizeju.com
Tohtml:ToHtml (True): ToHtml (false): ToString:HEAD:Tag (123[1,0],129[1,6]): HEAD
Tag (129[1,6],197[1,74]): Meta http-equiv= "Content-type" content= "text/html; ...
Tag (197[1,74],204[1,81]): Title
TXT (204[1,81],223[1,100]): Shirasawa ju-www.baizeju.com
End (223[1,100],231[1,108]):/title
End (231[1,108],238[1,115]):/head

=================================================
GetText:
Getplaintext:
ToHtml:
ToHtml (True):
ToHtml (FALSE):
Tostring:txt (238[1,115],240[2,0]): \ n
=================================================
gettext:html xmlns= "http://www.w3.org/1999/xhtml"
Getplaintext:
Shirasawa Ju-www.baizeju.com
Shirasawa Ju-www.baizeju.com
Shirasawa Ju-www.baizeju.com
tohtml:<body >
<div id= "Top_main" >
<div id= "Logoindex" >
<!--This is a comment--
Shirasawa Ju-www.baizeju.com
<a href= "http://www.baizeju.com" > Shirasawa ju-www.baizeju.com</a>
</div>
Shirasawa Ju-www.baizeju.com
</div>
</body>
ToHtml (True): <body >
<div id= "Top_main" >
<div id= "Logoindex" >
<!--This is a comment--
Shirasawa Ju-www.baizeju.com
<a href= "http://www.baizeju.com" > Shirasawa ju-www.baizeju.com</a>
</div>
Shirasawa Ju-www.baizeju.com
</div>
</body>
ToHtml (false): <body >
<div id= "Top_main" >
<div id= "Logoindex" >
<!--This is a comment--
Shirasawa Ju-www.baizeju.com
<a href= "http://www.baizeju.com" > Shirasawa ju-www.baizeju.com</a>
</div>
Shirasawa Ju-www.baizeju.com
</div>
</body>
Tostring:tag (240[2,0],283[2,43]): HTML xmlns= "http://www.w3.org/1999/xhtml"
TXT (283[2,43],285[3,0]): \ n
Tag (285[3,0],292[3,7]): Body
TXT (292[3,7],294[4,0]): \ n
Tag (294[4,0],313[4,19]): div id= "Top_main"
TXT (313[4,19],316[5,1]): \n\t
Tag (316[5,1],336[5,21]): div id= "Logoindex"
TXT (336[5,21],340[6,2]): \n\t\t
Rem (340[6,2],351[6,13]): This is a comment
TXT (351[6,13],376[8,0]): \n\t\t Shirasawa ju-www.baizeju.com\n
Tag (376[8,0],409[8,33]): A href= "http://www.baizeju.com"
TXT (409[8,33],428[8,52]): Shirasawa ju-www.baizeju.com
End (428[8,52],432[8,56]):/A
TXT (432[8,56],435[9,1]): \n\t
End (435[9,1],441[9,7]):/div
TXT (441[9,7],465[11,0]): \n\t Shirasawa ju-www.baizeju.com\n
End (465[11,0],471[11,6]):/div
TXT (471[11,6],473[12,0]): \ n
End (473[12,0],480[12,7]):/body
TXT (480[12,7],482[13,0]): \ n
End (482[13,0],489[13,7]):/html

=================================================
Deal with the first node, the first line <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd This is a better understanding.
From this output, you can also see the content of the tree-like structure. or a forest structure. The first layer of tag in page content, such as Doctype,head and HTML, forms one of the highest-level node nodes (many people may have a bit of a rare connection to the contents of the second and fourth node. In fact, these two node are two line-break tags.htmlparser The HTML page content of all line breaks, spaces, tabs, etc. are converted to the corresponding tag, so there is such a node。 Although the content is small but the level is high, hehe)
Getplaintextstring is to include all the content that the user can see. There are two interesting points, one is that the title content in the In addition, it may be found that the results of tohtml,tohtml (true) and toHtml (false) are no different. In fact, if the tracking Htmlparser source code can be found, node sub-class is Abstractnode, which implements the tohtml () source, direct misappropriation tohtml (false), and In the implementation of the ToHtml (Boolean verbatim) in the three subclasses of Abstractnode Remarknode,tagnode and Textnode, the verbatim parameter is not processed, so the results of the three functions are identical. If you don't need to implement your own special deal, simply use tohtml.
The HTML node class continues to relate (this is copied from another article): Abstractnodes is a direct subclass of node and an abstract class. It's threethe direct subclass implementation isRemarknodefor retention of annotations. In the ToString section of the output you can see a "Rem (345[6,2],356[6,13]): This is a comment, which is a remarknode.Textnode is simple, too .is the text message that is visible to the user.Tagnode is the most chaotic, including all the tags in the HTML language, and can be extended (extending Htmlparser's ability to handle self-defining labels). Tagnode includes two categories, one is a simple tag, the actual means include other tags tag, can only do leaf node. Another type is Compositetag, which can include other tags, which are branch nodes

Htmlparser use Details (2)-node content

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.