Htmlparser usage (4)-access content through the visitor

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Htmlparser traverses the content of the webpage and saves the result in a tree (forest) structure. Htmlparser can access the result content in two ways. Use filter and visitor.
The following describes how to use visitor to access the content.

4.1 nodevisitor
Simply and unilaterally, filter filters are used to filter out the desired node based on certain conditions and then process it. Visitor traverses every node in the content tree and processes nodes that meet the conditions. The actual results are the same. Two different methods can achieve the same results.
The following is a common example of nodevisitro.
TestCode:
Public static void main (string [] ARGs ){
Try {
Parser = new Parser (httpurlconnection) (new URL ("http: // 127.0.0.1: 8080/htmlparsertester.html"). openconnection ());

Nodevisitor visitor = new nodevisitor (false, false ){
Public void visittag (TAG tag ){
Message ("This Is Tag:" + tag. gettext ());
}
Public void visitstringnode (text string ){
Message ("this is text:" + String );
}
Public void visitremarknode (Remark remark ){
Message ("this is remark:" + remark. gettext ());
}
Public void beginparsing (){
Message ("beginparsing ");
}
Public void visitendtag (TAG tag ){
Message ("visitendtag:" + tag. gettext ());
}
Public void finishedparsing (){
Message ("finishedparsing ");
}
};

Parser. visitallnodeswith (visitor );
}
Catch (exception e ){
E. printstacktrace ();
}
}
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is text: txt (244 [1,121], 246 []): \ n
Finishedparsing

As you can see, before you start to traverse the nodes, beginparsing is called first, then the intermediate node is processed, and finishparsing is called before the end of the traversal. Because the recursechildren and recurseself I set are both false, visitor neither accesses the subnode nor the content of the root node. The two \ n outputs in the middle are the two line breaks at the top layer we discussed in htmlparser (1)-initializing parser.

Set recurseself to true to see what will happen.
Nodevisitor visitor = new nodevisitor (false, True ){
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is Tag: Head
This is text: txt (244 [1,121], 246 []): \ n
This is Tag: HTML xmlns = "http://www.w3.org/1999/xhtml"
Finishedparsing
We can see that the first layer of the HTML page is called.

Let's call the following method:
Nodevisitor visitor = new nodevisitor ( True, false ){
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is Tag: Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"
This is text: txt (204 [229], 1,106 []): baizeju.com
Visitendtag:/Title
Visitendtag:/head
This is text: txt (244 [1,121], 246 []): \ n
This is text: txt (289 [291], []): \ n
This is text: txt (298 [300], []): \ n
This is text: txt (319 [322], []): \ n \ t
This is text: txt (342 [5, 21], 346 [6, 2]): \ n \ t
This is remark: This is a comment on Bai zeju -www.baizeju.com
This is text: txt (378 [6, 34], 408 [8, 0]): \ n \ t Bai zeju-string 1-www.baizeju.com \ n
This is text: txt (441 [465], []): baizeju.com
Visitendtag:/
This is text: txt (469 [472], []): \ n \ t
Visitendtag:/Div
This is text: txt (478 [507], []): \ n \ t white Ze-string 2-www.baizeju.com \ n
Visitendtag:/Div
This is text: txt (513 [515], []): \ n
Visitendtag:/body
This is text: txt (522 [524], []): \ n
Visitendtag:/html
Finishedparsing
We can see that all the subnodes appear, except the two top-level nodes in the example above. This is Tag: Head and this is Tag: HTML xmlns = "http://www.w3.org/1999/xhtml ".

To make them all out, you only need
Nodevisitor visitor = new nodevisitor ( True, true ){
Output result:
Beginparsing
This is Tag :! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
This is text: txt (121 [0,121], 123 []): \ n
This is Tag: Head
This is Tag: Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"
This is Tag: Title
This is text: txt (204 [229], 1,106 []): baizeju.com
Visitendtag:/Title
Visitendtag:/head
This is text: txt (244 [1,121], 246 []): \ n
This is Tag: HTML xmlns = "http://www.w3.org/1999/xhtml"
This is text: txt (289 [291], []): \ n
This is Tag: Body
This is text: txt (298 [300], []): \ n
This is Tag: div id = "top_main"
This is text: txt (319 [322], []): \ n \ t
This is Tag: div id = "logoindex"
This is text: txt (342 [5, 21], 346 [6, 2]): \ n \ t
This is remark: This is a comment on Bai zeju -www.baizeju.com
This is text: txt (378 [6, 34], 408 [8, 0]): \ n \ t Bai zeju-string 1-www.baizeju.com \ n
This is tag: a href = "http://www.baizeju.com"
This is text: txt (441 [465], []): baizeju.com
Visitendtag:/
This is text: txt (469 [472], []): \ n \ t
Visitendtag:/Div
This is text: txt (478 [507], []): \ n \ t white Ze-string 2-www.baizeju.com \ n
Visitendtag:/Div
This is text: txt (513 [515], []): \ n
Visitendtag:/body
This is text: txt (522 [524], []): \ n
Visitendtag:/html
Finishedparsing
Haha, now the call is clear. You can add your own code where you need to handle it.

4.2 other visitor
Htmlparser also defines several other visitors. Htmlpage, nodevisitor, objectfindingvisitor, stringfindingvisitor, tagfindingvisitor, textextractingvisitor, and urlmodifyingvisitor are all subclasses of nodevisitor and implement some specific functions. I personally feel that it is useless. If you need a specific function, it is better to write it by yourself. If you want to find the desired function in it, the time may be more. If you look at the code, you will find that there are only a few lines of code that really work.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Htmlparser usage (4)-access content through the visitor

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Htmlparser usage (4)-access content through the visitor

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support