first of all , introduce the core class of the next Htmlparser, Org.htmlparser.Parser class, this class actually completed the analysis of the HTML page. The main constructors are as follows:
public Parser (); Public Parser (String Resource) throws Parserexception; Public Parser (String resource, parserfeedback feedback) throws parserexception; Public Parser (URLConnection connection) throws Parserexception; Public Parser (urlconnection connection, Parserfeedback fb) throws Parserexception; Public Parser (Lexer Lexer); Public Parser (Lexer Lexer, Parserfeedback FB);
public static Parser Createparser (string html, string charset);
A common way to create parser is as follows:
Method One: Extract Web pages on the network by URL
Use public Parser (); constructor Parser Parser = new Parser (); Parser.seturl ("http://www.yahoo.com.cn"); Use public Parser (URLConnection connection) throws Parserexception; constructor Parser Parser = new Parser ( ( HttpURLConnection) (New URL ("http://www.baidu.com")). OpenConnection () ); Org.htmlparser.http.ConnectionManager manager = Org.htmlparser.lexer.Page.getConnectionManager (); Parser Parser = new Parser (manager.openconnection ("http://www.baidu.com")); Parser.setencoding ("GB2312");
Method Two: Extract the local Web page file (by reading the file to convert the Web file into a string)
/Use static method Parser Parser=parser.createparser (html,charset);
NodeThere are several types of methods included in the:
functions for traversing a tree structure, these functions are most easily understood:
NodeGetParent(): Get parent node
NodeListGetChildren(): Gets a list of child nodes
NodeGetfirstchild(): Gets the first child node
NodeGetlastchild(): Gets the last child node
NodeGetprevioussibling(): Get the former brother (sorry, English is brothers and sisters, literal translation is too troublesome and not in line with the habit, sorry female compatriots)
NodeGetnextsibling(): Get Next sibling node
GetNodefunctions of the content:
StringGetText(): Get text
StringToplaintextstring(): Get plain text information.
StringToHtml(): ObtainedHTML Information (originalHTML)
StringToHtml(Boolean verbatim): obtainedHTML Information (originalHTML)
StringTostring(): Gets the string information (originalHTML)
PageGetPage(): Get thisNode corresponds to thePage Object
IntGetStartPosition(): Get thisNode inThe starting position in the HTML page
IntGetendposition(): Get thisNode inEnd position in HTML page
used toFilterFiltered functions:
voidCollectinto(NodeList list, Nodefilter filter): Based onFilter conditions are filtered for this node, and the eligible nodes are placed in theList.
used toVisitorfunctions to traverse:
voidAccept(Nodevisitor visitor): for thisNode appVisitor
functions for modifying content, which is less useful :
void < Span lang= "en-US" >setpage (Page page): Set this node corresponding Page object
Void settext (String text): Set the text Span lang= "en-us" >
Void setchildren (NodeList children) : Set child node list
other functions :
Void dosemanticaction (): Perform this node corresponding operation (only a few tag has corresponding operation)
Object clone (): interface clone abstract function.
/span>
Htmlparser Learning Notes (i)--creating parser objects