Htmlparser learning Summary

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Htmlparser is an excellent web page information capturing tool. The following describes some basic usage:

1. Create a parser object in two ways
Parser = new Parser (string HTML)
Imported html
The 2nd types are:
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
Then you can access the parsed content in parser.
2. There are two types of Resolution Methods: visitor and filter. vistior needs to traverse each node,
The filter method is filter.

3 visitor example;
try {
// create a parser object by specifying the urlconnection object
parser = new Parser (httpurlconnection) (new URL (URL )). openconnection ();
// sets the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
parser. setencoding ("gb2312");
// create a linkfindingvisitor object
linkfindingvisitor lvisitor = new linkfindingvisitor ("http://news.qq.com /");
// find the number of links to the http://www.qq.com
parser. visitallnodeswith (lvisitor);
system. out. println ("the number of links on the webpage that contain http://news.qq.com/:" + lvisitor. getcount ();
}catch (exception ex) {
ex. printstacktrace ();
}

/** Example of textextractingvisitor class usage */
Public static void testtextractingvisitor (string URL ){
Try {
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a stringfindingvisitor object
Textextractingvisitor visitor = new textractingvisitor ();
// Remove all tags on the webpage and propose plain text content
Parser. visitallnodeswith (visitor );
System. Out. println ("the plain text content of the webpage is:" + visitor. getextractedtext ());
} Catch (exception ex ){
Ex. printstacktrace ();
}
}

4. You can also customize nodevisitor to expand nodevisitor and reload the methods in it:
/** Customize the nodevisitor subclass and overload the related methods in the abstract class nodevisitor */
Public class mynodevisitor extends nodevisitor {

/** Load the beginparsing method of the abstract class nodevisitor. This method is called at the beginning of parsing */
Public void beginparsing (){
System. Out. println ("starting to parse HTML content ......");
}

/** Overload the finishedparsing method of the abstract class nodevisitor. This method is called at the end of parsing */
Public void finishedparsing (){
System. Out. println ("the entire HTML content has been parsed! ");
}

/** Overload the visittag method of the abstract class nodevisitor. This method is called when the start label is encountered */
Public void visittag (TAG tag ){
System. Out. println ("Start current Tag:" + tag. gettext ());
}

/** Call the visitendtag method of the abstract class nodevisitor in case of an end tag */
Public void visitendtag (TAG tag ){
System. Out. println ("end current Tag:" + tag. gettext ());
}

/** Call the visitstringnode method of the abstract class nodevisitor in case of a text node */
Public void visitstringnode (text string ){
System. Out. println ("Current text node:" + String );
}

/** Call the visitremarknode method of the abstract class nodevisitor in case of comments */
Public void visitremarknode (Remark remark ){
System. Out. println ("current Comment:" + remark );
}

5. Filter
Basic usage:
Tagnamefilter usage
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a tagnamefilter instance
Nodefilter filter = new tagnamefilter ("Div ");
// Filter all Div label nodes
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("Current Div:" + textnode. gettext ());
}

Andfilter usage
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a hasattributefilter instance
Nodefilter filter1 = new hasattributefilter ("ID ");
// Create a tagnamefilter instance
Nodefilter innerfilter = new tagnamefilter ("Div ");
// Create a haschildfilter instance
Nodefilter filter2 = new haschildfilter (innerfilter );
// Create an andfilter instance
Nodefilter filter = new andfilter (filter1, filter2 );
// Filter all Div nodes with ID attributes and subnodes
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("Current Div:" + textnode. gettext ());
}
}
Usage of the stringfilter class:
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a stringfilter instance
Nodefilter filter = new stringfilter ("Chen shui-bian ");
// Filter all text nodes that contain the "Chen shui-bian" String
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("text nodes containing the \" Chen shui-bian \ "string:" + textnode. gettext ());
}
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Htmlparser learning Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Htmlparser learning Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support