Htmlparser saves parsed information as a tree structure. Node is the basis for storing information.
See the definition of node:
Public interface node extends cloneable;
There are several methods in node:
For functions that traverse the tree structure, these functions are the easiest to understand:
Node getparent (): gets the parent node
Nodelist getchildren (): obtains the list of subnodes.
Node getfirstchild (): gets the first subnode
Node getlastchild (): gets the last subnode
Node getpreviussibling)
Node getnextsibling (): gets the next sibling Node
Function for obtaining node content:
String gettext (): Get Text
String toplaintextstring (): obtains plain text information.
String tohtml (): GET html information (original HTML)
String tohtml (Boolean verbatim): gets HTML information (original HTML)
String tostring (): gets string information (original HTML)
Page getpage (): obtains the Page Object corresponding to this node.
Int getstartposition (): Get the start position of the node in the HTML page.
Int getendposition (): gets the end position of the node in the HTML page.
Filter filter functions:
Void collectinto (nodelist list, nodefilter filter): this node is filtered Based on filter conditions, and nodes that meet the filter conditions are placed in the list.
Functions used for visitor traversal:
Void accept (nodevisitor visitor): Apply visitor to this node.
Functions used to modify content are rarely used:
Void setpage (page): sets the Page Object corresponding to this node.
Void settext (string text): sets the text
Void setchildren (nodelist children): sets the subnode list.
Other functions:
Void dosemanticaction (): execute the operation corresponding to this node (only a few tags have corresponding operations)
Object clone (): the abstract function of interface clone.
the most commonly used htmlparser is to process HTML pages. filter or visitor-related functions are required, and the first and second types of functions are the most used. The first type of functions is easier to understand. The following example describes the second type of functions.
The following HTML file is used for testing:
Bai zeju -www.baizeju.com
Bai ze ju -www.baizeju.com
TestCode:
/**
* @ Author www.baizeju.com
*/
Package com.baizeju.html parsertester;
Import java. Io. bufferedreader;
Import java. Io. inputstreamreader;
Import java. Io. fileinputstream;
Import java. Io. file;
Import java.net. httpurlconnection;
Import java.net. url;
Import org.html parser. node;
Import org.html parser. util. nodeiterator;
Import org.html parser. parser;
/**
* @ Author www.baizeju.com
*/
Public class main {
Private Static string encode = "GBK ";
Private Static void message (string szmsg ){
Try {system. Out. println (new string (szmsg. getbytes (encode), system. getproperty ("file. encoding");} catch (exception e ){}
}
Public static string openfile (string szfilename ){
Try {
Bufferedreader Bis = new bufferedreader (New inputstreamreader (New fileinputstream (new file (szfilename), encode ));
String szcontent = "";
String sztemp;
While (sztemp = bis. Readline ())! = NULL ){
Szcontent + = sztemp + "\ n ";
}
Bis. Close ();
Return szcontent;
}
Catch (exception e ){
Return "";
}
}
Public static void main (string [] ARGs ){
Try {
Parser = new Parser (httpurlconnection) (new URL ("http: // 127.0.0.1: 8080/htmlparsertester.html"). openconnection ());
For (nodeiterator I = parser. Elements (); I. hasmorenodes ();){
Node node = I. nextnode ();
Message ("gettext:" + node. gettext ());
Message ("getplaintext:" + node. toplaintextstring ());
Message ("tohtml:" + node. tohtml ());
Message ("tohtml (true):" + node. tohtml (true ));
Message ("tohtml (false):" + node. tohtml (false ));
Message ("tostring:" + node. tostring ());
Message ("============================================ ============ ");
}
}
Catch (exception e ){
System. Out. println ("exception:" + E );
}
}
}
Output result:
Gettext :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
Getplaintext:
Tohtml: <! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Tohtml (true): <! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Tohtml (false): <! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Tostring: doctype Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd; begins at: 0; ends at: 121
========================================================== ==========
Gettext:
Getplaintext:
Tohtml:
Tohtml (true ):
tohtml (false):
tostring: txt (121 [0,121], 123 []): \ n
============================================ ===============< br> gettext: head
getplaintext: Bai zeju -www.baizeju.com
tohtml: Bai zeju -www.baizeju.com
tohtml (true ): Bai zeju -www.baizeju.com
tohtml (false ): Bai zeju -www.baizeju.com
tostring: Head: Tag (123 [129], []): head
tag (129 [197], []): Meta http-equiv = "Content-Type" content = "text/html ;...
tag (197 [204], 204 [223]): Title
txt (1,100 [], []): bai zeju -www.baizeju.com
end (223 [1,100], 231 [1,108]):/Title
end (231 [1,108], 238 [1,115]): /head
========================================================== ==========
Gettext:
Getplaintext:
Tohtml:
Tohtml (true ):
Tohtml (false ):
Tostring: txt (238 [1,115], 240 []): \ n
========================================================== ==========
Gettext: HTML xmlns = "http://www.w3.org/1999/xhtml"
Getplaintext:
Bai zeju -www.baizeju.com
Bai zeju -www.baizeju.com
Bai zeju -www.baizeju.com
Tohtml: <HTML xmlns = "http://www.w3.org/1999/xhtml">
<Body>
<Div id = "top_main">
<Div id = "logoindex">
<! -- This is a comment -->
Bai zeju -www.baizeju.com
<A href = "http://www.baizeju.com"> Bai zeju -www.baizeju.com </a>
</Div>
Bai zeju -www.baizeju.com
</Div>
</Body>
</Html>
Tohtml (true): <HTML xmlns = "http://www.w3.org/1999/xhtml">
<Body>
<Div id = "top_main">
<Div id = "logoindex">
<! -- This is a comment -->
Bai zeju -www.baizeju.com
<A href = "http://www.baizeju.com"> Bai zeju -www.baizeju.com </a>
</Div>
Bai zeju -www.baizeju.com
</Div>
</Body>
</Html>
Tohtml (false): <HTML xmlns = "http://www.w3.org/1999/xhtml">
<Body>
<Div id = "top_main">
<Div id = "logoindex">
<! -- This is a comment -->
Bai zeju -www.baizeju.com
<A href = "http://www.baizeju.com"> Bai zeju -www.baizeju.com </a>
</Div>
Bai zeju -www.baizeju.com
</Div>
</Body>
</Html>
Tostring: Tag (240 [283], []): HTML xmlns = "http://www.w3.org/1999/xhtml"
TXT (283 [285], []): \ n
Tag (285 [292], []): Body
TXT (292 [294], []): \ n
Tag (294 [313], []): div id = "top_main"
TXT (313 [316], []): \ n \ t
Tag (316 [336], []): div id = "logoindex"
TXT (336 [340], []): \ n \ t
REM (340 [351], []): This is a comment
TXT (351 [6, 13], 376 [8, 0]): \ n \ t Bai zeju -www.baizeju.com \ n
Tag (376 [409], []): a href = "http://www.baizeju.com"
TXT (409 [428], []): Bai zeju -www.baizeju.com
End (428 [432], []):/
TXT (432 [435], []): \ n \ t
End (435 [441], []):/Div
TXT (441 [465], []): \ n \ t Bai zeju -www.baizeju.com \ n
End (465 [471], []):/Div
TXT (471 [473], []): \ n
End (473 [480], []):/body
TXT (480 [482], []): \ n
End (482 [489], []):/html
========================================================== ==========
The content of the first node corresponds to the first line <! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">, which is easier to understand.
The tree structure of the content can also be seen from the output result. Or a forest structure. The first layer of the page content, such as doctype, Head, and HTML, form a node at the highest level (many people may be a bit strange about the content of the second and fourth nodes. In fact, these two nodes are two line breaks. Htmlparser converts all the line breaks, spaces, and tabs in the HTML page content into corresponding tags, so such a node appears. Although the content is small but the level is high, haha)
Getplaintextstring contains all the content that you can see. There are two interesting points. One is that the title content in the
In addition, you may find that the results of tohtml, tohtml (true), and tohtml (false) are no different. This is also the case. If you trace the htmlparser code, you can find that the node subclass is abstractnode, where tohtml () code is implemented and tohtml (false) is directly called ), the verbatim parameters are not processed in the implementation of tohtml (Boolean verbatim) in the three subclasses of abstractnode, remarknode, tagnode, and textnode, so the results of the three functions are identical. If you do not need to implement any special processing, you can simply use tohtml.
The node class inheritance relationship of HTML is as follows:ArticleCopy ):
Abstractnodes is a direct subclass of node and an abstract class. Its three direct subclass implementations are remarknode, which is used to save comments. In the tostring section of the output result, we can see that there is a "REM (345 [6, 2], 356 [6, 13]): This is a comment", which is a remarknode. Textnode is also very simple, that is, the text information visible to the user. Tagnode is the most complex. It contains all the tags in the HTML language and can be extended (htmlparser's ability to process custom tags ). Tagnodes are classified into two types. One type is simple tags. Actually, tags that cannot contain other tags can only be used as leaf nodes. The other type is compositetag, which can contain other tags and is a branch node.