Htmlparser lexer parse webpage source code

Source: Internet
Author: User
Tags lexer

Sometimes, when parsing the source code of a webpage, we need not the source code of the entire page, but the text content. In this case, we can use the htmlparser open-source tool for corresponding operations. The following is an example. This example is simple and mainly aims to understand the functions of some classes in this tool:
[Javascript]
Public static String html2Text (){
ConnectionManager manager;
Lexer lexer;
Node node;
Manager = Page. getConnectionManager ();
StringBuilder textSB = new StringBuilder ();
StringBuilder tagSB = new StringBuilder ();
StringBuilder remarkSB = new StringBuilder ();
// StringBuilder abstractSB = new StringBuilder ();
Try {
Lexer = new Lexer (manager. openConnection ("http://www.baidu.com "));
While (node = lexer. nextNode ())! = Null ){
If (node instanceof TextNode ){
TextSB. append (node. toHtml ());
} Else if (node instanceof TagNode ){
TagSB. append (node. toHtml ());
} Else if (node instanceof RemarkNode ){
RemarkSB. append (node. toHtml ());
} Else if (node instanceof AbstractNode ){
// AbstractSB. append (node. toHtml ());
}
}
Return textSB. toString () + "\ r \ n" + "--------" + tagSB. toString ()
+ "\ R \ n" + "--------" + remarkSB. toString () + "\ r \ n"
+ "-------"
// + AbstractSB. toString ()
;
} Catch (Exception e ){
Throw new RuntimeException ();
}
} Www.2cto.com
The following describes some main classes in this tool:
ConnctionManger is a class for connecting to a Web page. Lexer can pass a URLConnection through the constructor, and then the content of the web page can be obtained. Lexer is like a wrapper that wraps the content of the Web page, then, you can use the method in Lexer to obtain the webpage content. The nextNode of Lexer is the Node that begins to get the webpage. The returned Node class is implemented by AbstractNode, and TextNode, TagNode, and RemarkNode are subclasses of AbstractNode. TextNode is all text nodes in the webpage, TagNode is all label nodes in the webpage, and RemarkNode is all comment nodes in the webpage. AbstractNode provides methods such as the starting and ending locations of webpage access, parent nodes of child nodes, and list of child nodes. Therefore, TextNode, TagNode, and RemarkNode can also use these methods to obtain relevant information.
Author: uohzoaix

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.