Htmlparser traverses the content of the webpage and saves the result in a tree (forest) structure. Htmlparser can access the result content in two ways. Use filter and visitor.
The following describes how to use visitor to access the content.
4.1 nodevisitor
Simply and unilaterally, filter filters are used to filter out the desired node based on certain conditions and then process it. Visitor traverses every node in the content tree and processes nodes that meet the conditions. The actual results are the same. Two different methods can achieve the same results.
The following is a common example of nodevisitro.
TestCode:
Public static void main (string [] ARGs ){
Try {
Parser = new Parser (httpurlconnection) (new URL ("http: // 127.0.0.1: 8080/htmlparsertester.html"). openconnection ());
Nodevisitor visitor = new nodevisitor (false, false ){
Public void visittag (TAG tag ){
Message ("This Is Tag:" + tag. gettext ());
}
Public void visitstringnode (text string ){
Message ("this is text:" + String );
}
Public void visitremarknode (Remark remark ){
Message ("this is remark:" + remark. gettext ());
}
Public void beginparsing (){
Message ("beginparsing ");
}
Public void visitendtag (TAG tag ){
Message ("visitendtag:" + tag. gettext ());
}
Public void finishedparsing (){
Message ("finishedparsing ");
}
};
Parser. visitallnodeswith (visitor );
}
Catch (exception e ){
E. printstacktrace ();
}
}
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is text: txt (244 [1,121], 246 []): \ n
Finishedparsing
As you can see, before you start to traverse the nodes, beginparsing is called first, then the intermediate node is processed, and finishparsing is called before the end of the traversal. Because the recursechildren and recurseself I set are both false, visitor neither accesses the subnode nor the content of the root node. The two \ n outputs in the middle are the two line breaks at the top layer we discussed in htmlparser (1)-initializing parser.
Set recurseself to true to see what will happen.
Nodevisitor visitor = new nodevisitor (false, True ){
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is Tag: Head
This is text: txt (244 [1,121], 246 []): \ n
This is Tag: HTML xmlns = "http://www.w3.org/1999/xhtml"
Finishedparsing
We can see that the first layer of the HTML page is called.
Let's call the following method:
Nodevisitor visitor = new nodevisitor ( True, false ){
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is Tag: Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"
This is text: txt (204 [229], 1,106 []): baizeju.com
Visitendtag:/Title
Visitendtag:/head
This is text: txt (244 [1,121], 246 []): \ n
This is text: txt (289 [291], []): \ n
This is text: txt (298 [300], []): \ n
This is text: txt (319 [322], []): \ n \ t
This is text: txt (342 [5, 21], 346 [6, 2]): \ n \ t
This is remark: This is a comment on Bai zeju -www.baizeju.com
This is text: txt (378 [6, 34], 408 [8, 0]): \ n \ t Bai zeju-string 1-www.baizeju.com \ n
This is text: txt (441 [465], []): baizeju.com
Visitendtag:/
This is text: txt (469 [472], []): \ n \ t
Visitendtag:/Div
This is text: txt (478 [507], []): \ n \ t white Ze-string 2-www.baizeju.com \ n
Visitendtag:/Div
This is text: txt (513 [515], []): \ n
Visitendtag:/body
This is text: txt (522 [524], []): \ n
Visitendtag:/html
Finishedparsing
We can see that all the subnodes appear, except the two top-level nodes in the example above. This is Tag: Head and this is Tag: HTML xmlns = "http://www.w3.org/1999/xhtml ".
To make them all out, you only need
Nodevisitor visitor = new nodevisitor ( True, true ){
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is Tag: Head
This is Tag: Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"
This is Tag: Title
This is text: txt (204 [229], 1,106 []): baizeju.com
Visitendtag:/Title
Visitendtag:/head
This is text: txt (244 [1,121], 246 []): \ n
This is Tag: HTML xmlns = "http://www.w3.org/1999/xhtml"
This is text: txt (289 [291], []): \ n
This is Tag: Body
This is text: txt (298 [300], []): \ n
This is Tag: div id = "top_main"
This is text: txt (319 [322], []): \ n \ t
This is Tag: div id = "logoindex"
This is text: txt (342 [5, 21], 346 [6, 2]): \ n \ t
This is remark: This is a comment on Bai zeju -www.baizeju.com
This is text: txt (378 [6, 34], 408 [8, 0]): \ n \ t Bai zeju-string 1-www.baizeju.com \ n
This is tag: a href = "http://www.baizeju.com"
This is text: txt (441 [465], []): baizeju.com
Visitendtag:/
This is text: txt (469 [472], []): \ n \ t
Visitendtag:/Div
This is text: txt (478 [507], []): \ n \ t white Ze-string 2-www.baizeju.com \ n
Visitendtag:/Div
This is text: txt (513 [515], []): \ n
Visitendtag:/body
This is text: txt (522 [524], []): \ n
Visitendtag:/html
Finishedparsing
Haha, now the call is clear. You can add your own code where you need to handle it.
4.2 other visitor
Htmlparser also defines several other visitors. Htmlpage, nodevisitor, objectfindingvisitor, stringfindingvisitor, tagfindingvisitor, textextractingvisitor, and urlmodifyingvisitor are all subclasses of nodevisitor and implement some specific functions. I personally feel that it is useless. If you need a specific function, it is better to write it by yourself. If you want to find the desired function in it, the time may be more. If you look at the code, you will find that there are only a few lines of code that really work.
HT