Htmlparser is an excellent web page information capturing tool. The following describes some basic usage:
1. Create a parser object in two ways
Parser = new Parser (string HTML)
Imported html
The 2nd types are:
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
Then you can access the parsed content in parser.
2. There are two types of Resolution Methods: visitor and filter. vistior needs to traverse each node,
The filter method is filter.
3 visitor example;
try {
// create a parser object by specifying the urlconnection object
parser = new Parser (httpurlconnection) (new URL (URL )). openconnection ();
// sets the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
parser. setencoding ("gb2312");
// create a linkfindingvisitor object
linkfindingvisitor lvisitor = new linkfindingvisitor ("http://news.qq.com /");
// find the number of links to the http://www.qq.com
parser. visitallnodeswith (lvisitor);
system. out. println ("the number of links on the webpage that contain http://news.qq.com/:" + lvisitor. getcount ();
}catch (exception ex) {
ex. printstacktrace ();
}
/** Example of textextractingvisitor class usage */
Public static void testtextractingvisitor (string URL ){
Try {
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a stringfindingvisitor object
Textextractingvisitor visitor = new textractingvisitor ();
// Remove all tags on the webpage and propose plain text content
Parser. visitallnodeswith (visitor );
System. Out. println ("the plain text content of the webpage is:" + visitor. getextractedtext ());
} Catch (exception ex ){
Ex. printstacktrace ();
}
}
4. You can also customize nodevisitor to expand nodevisitor and reload the methods in it:
/** Customize the nodevisitor subclass and overload the related methods in the abstract class nodevisitor */
Public class mynodevisitor extends nodevisitor {
/** Load the beginparsing method of the abstract class nodevisitor. This method is called at the beginning of parsing */
Public void beginparsing (){
System. Out. println ("starting to parse HTML content ......");
}
/** Overload the finishedparsing method of the abstract class nodevisitor. This method is called at the end of parsing */
Public void finishedparsing (){
System. Out. println ("the entire HTML content has been parsed! ");
}
/** Overload the visittag method of the abstract class nodevisitor. This method is called when the start label is encountered */
Public void visittag (TAG tag ){
System. Out. println ("Start current Tag:" + tag. gettext ());
}
/** Call the visitendtag method of the abstract class nodevisitor in case of an end tag */
Public void visitendtag (TAG tag ){
System. Out. println ("end current Tag:" + tag. gettext ());
}
/** Call the visitstringnode method of the abstract class nodevisitor in case of a text node */
Public void visitstringnode (text string ){
System. Out. println ("Current text node:" + String );
}
/** Call the visitremarknode method of the abstract class nodevisitor in case of comments */
Public void visitremarknode (Remark remark ){
System. Out. println ("current Comment:" + remark );
}
5. Filter
Basic usage:
Tagnamefilter usage
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a tagnamefilter instance
Nodefilter filter = new tagnamefilter ("Div ");
// Filter all Div label nodes
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("Current Div:" + textnode. gettext ());
}
Andfilter usage
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a hasattributefilter instance
Nodefilter filter1 = new hasattributefilter ("ID ");
// Create a tagnamefilter instance
Nodefilter innerfilter = new tagnamefilter ("Div ");
// Create a haschildfilter instance
Nodefilter filter2 = new haschildfilter (innerfilter );
// Create an andfilter instance
Nodefilter filter = new andfilter (filter1, filter2 );
// Filter all Div nodes with ID attributes and subnodes
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("Current Div:" + textnode. gettext ());
}
}
Usage of the stringfilter class:
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a stringfilter instance
Nodefilter filter = new stringfilter ("Chen shui-bian ");
// Filter all text nodes that contain the "Chen shui-bian" String
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("text nodes containing the \" Chen shui-bian \ "string:" + textnode. gettext ());
}
}