Sometimes, when parsing the source code of a webpage, we need not the source code of the entire page, but the text content. In this case, we can use the htmlparser open-source tool for corresponding operations. The following is an example. This example is simple and mainly aims to understand the functions of some classes in this tool:
[Javascript]
Public static String html2Text (){
ConnectionManager manager;
Lexer lexer;
Node node;
Manager = Page. getConnectionManager ();
StringBuilder textSB = new StringBuilder ();
StringBuilder tagSB = new StringBuilder ();
StringBuilder remarkSB = new StringBuilder ();
// StringBuilder abstractSB = new StringBuilder ();
Try {
Lexer = new Lexer (manager. openConnection ("http://www.baidu.com "));
While (node = lexer. nextNode ())! = Null ){
If (node instanceof TextNode ){
TextSB. append (node. toHtml ());
} Else if (node instanceof TagNode ){
TagSB. append (node. toHtml ());
} Else if (node instanceof RemarkNode ){
RemarkSB. append (node. toHtml ());
} Else if (node instanceof AbstractNode ){
// AbstractSB. append (node. toHtml ());
}
}
Return textSB. toString () + "\ r \ n" + "--------" + tagSB. toString ()
+ "\ R \ n" + "--------" + remarkSB. toString () + "\ r \ n"
+ "-------"
// + AbstractSB. toString ()
;
} Catch (Exception e ){
Throw new RuntimeException ();
}
} Www.2cto.com
The following describes some main classes in this tool:
ConnctionManger is a class for connecting to a Web page. Lexer can pass a URLConnection through the constructor, and then the content of the web page can be obtained. Lexer is like a wrapper that wraps the content of the Web page, then, you can use the method in Lexer to obtain the webpage content. The nextNode of Lexer is the Node that begins to get the webpage. The returned Node class is implemented by AbstractNode, and TextNode, TagNode, and RemarkNode are subclasses of AbstractNode. TextNode is all text nodes in the webpage, TagNode is all label nodes in the webpage, and RemarkNode is all comment nodes in the webpage. AbstractNode provides methods such as the starting and ending locations of webpage access, parent nodes of child nodes, and list of child nodes. Therefore, TextNode, TagNode, and RemarkNode can also use these methods to obtain relevant information.
Author: uohzoaix