Htmlparser learning Summary

Source: Internet
Author: User

Htmlparser is an excellent web page information capturing tool. The following describes some basic usage:

 

1. Create a parser object in two ways
Parser = new Parser (string HTML)
Imported html
The 2nd types are:
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
Then you can access the parsed content in parser.
2. There are two types of Resolution Methods: visitor and filter. vistior needs to traverse each node,
The filter method is filter.

3 visitor example;
try {
// create a parser object by specifying the urlconnection object
parser = new Parser (httpurlconnection) (new URL (URL )). openconnection ();
// sets the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
parser. setencoding ("gb2312");
// create a linkfindingvisitor object
linkfindingvisitor lvisitor = new linkfindingvisitor ("http://news.qq.com /");
// find the number of links to the http://www.qq.com
parser. visitallnodeswith (lvisitor);
system. out. println ("the number of links on the webpage that contain http://news.qq.com/:" + lvisitor. getcount ();
}catch (exception ex) {
ex. printstacktrace ();
}

/** Example of textextractingvisitor class usage */
Public static void testtextractingvisitor (string URL ){
Try {
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a stringfindingvisitor object
Textextractingvisitor visitor = new textractingvisitor ();
// Remove all tags on the webpage and propose plain text content
Parser. visitallnodeswith (visitor );
System. Out. println ("the plain text content of the webpage is:" + visitor. getextractedtext ());
} Catch (exception ex ){
Ex. printstacktrace ();
}
}

 

4. You can also customize nodevisitor to expand nodevisitor and reload the methods in it:
/** Customize the nodevisitor subclass and overload the related methods in the abstract class nodevisitor */
Public class mynodevisitor extends nodevisitor {

/** Load the beginparsing method of the abstract class nodevisitor. This method is called at the beginning of parsing */
Public void beginparsing (){
System. Out. println ("starting to parse HTML content ......");
}

/** Overload the finishedparsing method of the abstract class nodevisitor. This method is called at the end of parsing */
Public void finishedparsing (){
System. Out. println ("the entire HTML content has been parsed! ");
}

/** Overload the visittag method of the abstract class nodevisitor. This method is called when the start label is encountered */
Public void visittag (TAG tag ){
System. Out. println ("Start current Tag:" + tag. gettext ());
}

/** Call the visitendtag method of the abstract class nodevisitor in case of an end tag */
Public void visitendtag (TAG tag ){
System. Out. println ("end current Tag:" + tag. gettext ());
}

/** Call the visitstringnode method of the abstract class nodevisitor in case of a text node */
Public void visitstringnode (text string ){
System. Out. println ("Current text node:" + String );
}

/** Call the visitremarknode method of the abstract class nodevisitor in case of comments */
Public void visitremarknode (Remark remark ){
System. Out. println ("current Comment:" + remark );
}

5. Filter
Basic usage:
Tagnamefilter usage
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a tagnamefilter instance
Nodefilter filter = new tagnamefilter ("Div ");
// Filter all Div label nodes
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("Current Div:" + textnode. gettext ());
}

Andfilter usage
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a hasattributefilter instance
Nodefilter filter1 = new hasattributefilter ("ID ");
// Create a tagnamefilter instance
Nodefilter innerfilter = new tagnamefilter ("Div ");
// Create a haschildfilter instance
Nodefilter filter2 = new haschildfilter (innerfilter );
// Create an andfilter instance
Nodefilter filter = new andfilter (filter1, filter2 );
// Filter all Div nodes with ID attributes and subnodes
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("Current Div:" + textnode. gettext ());
}
}
Usage of the stringfilter class:
// Create a parser object by specifying the urlconnection object
Parser = new Parser (httpurlconnection) (new URL (URL). openconnection ());
// Set the character encoding of the parser object, which is generally consistent with the character encoding of the webpage.
Parser. setencoding ("gb2312 ");
// Create a stringfilter instance
Nodefilter filter = new stringfilter ("Chen shui-bian ");
// Filter all text nodes that contain the "Chen shui-bian" String
Nodelist nodes = parser. extractallnodesthatmatch (filter );
If (nodes! = NULL ){
For (INT I = 0; I <nodes. Size (); I ++ ){
Node textnode = (node) nodes. elementat (I );
System. Out. println ("text nodes containing the \" Chen shui-bian \ "string:" + textnode. gettext ());
}
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.