A: Introduction
(1)HTML Parseris a used to parseHtmlof theJavacan be used in two ways, linear or nested. Mainly used for Web page conversion or extraction, he has some features: Filters filter, WalkerVisitors, the usual label tagname and easy-to-useJavaBeans. It is a fast, robust, and rigorously tested component.
(2) Personal understanding: Htmlparser after traversing the contents of a Web page, the results are saved in a tree (forest) structure, and each node represents the tags and attribute values in the HTML, very similar to the XML parser-parsed As a result, it is similar to the structure of the HTML DOM; Htmlparser There are two ways to access the results content: using Filter and use Visitor, General filter use more, for extracting specific Web information .
(3) Official API description (required by Google)
Two: main function function Description:
(1) Htmlparser Core module is org.htmlparser.parser class, This class is actually done for html page analysis work . This class has several constructors:
public Parser ();
public Parser (LEXER LEXER, PARSERFEEDBACK FB);
public Parser (URLCONNECTION CONNECTION, PARSERFEEDBACK FB) throws ParserException;
public Parser (String resource, parserfeedback feedback) throws ParserException;
public parser (String resource) Throws parserexception ;
public Parser (lexer lexer);
public parser (urlconnection connection) throws parserexception;
and a static class public static parser createparser (string html, string charset);
(2) Htmlparser saves the parsed information as a tree structure. Node is the basis of the data type for information preservation .
See node's definition:
Public interface Node extends cloneable;
NodeThere are several types of methods included in the:
functions for traversing a tree structure, these functions are most easily understood:
NodegetParent():Get parent Node
NodeListGetChildren():get a list of child nodes
NodeGetfirstchild():get the first child node
NodeGetlastchild():get the last child node
Nodegetprevioussibling():get the former brother (sorry, English is brothers and sisters, literal translation is too troublesome and not in line with the habit, sorry female compatriots)
Nodegetnextsibling():get the next sibling node
GetNodefunctions of the content:
StringGetText():Get text
String toplaintextstring(): gets plain text information .
String toHtml (): get HTML information (original html)
StringtoHtml(Boolean verbatim):GetHTMLInformation (OriginalHTML)
StringtoString():Gets the string information (originalHTML)
page getpage (): Gets the page object that this Node corresponds to
Intgetstartposition():get thisNodein theHTMLstart position in page
Intgetendposition():get thisNodein theHTMLthe end position in the page
(3) Other functions :
voidCollectinto(NodeList list, Nodefilter filter):based onFilterthe conditions for this node are filtered, the eligible nodes are placed in theListthe.
used toVisitorfunctions to traverse:
voidAccept(Nodevisitor visitor):on thisNodeApplicationVisitor
functions for modifying content, which are used less:
voidSetpage(Page page):Set thisNodecorresponding to thePageObject
voidSetText(String text):Set Text
void Setchildren (NodeList children): Set child node list
(4) functions for filter Filtering
as the name implies,Filteris to filter the results and get the content you need. Htmlparserin theorg.htmlparser.filterswithin the package defined altogether -a differentFilter, can also be divided into several categories.
Judging classFilter:
Tagnamefilter-----HTML tag specifies the specified filter
Hasattributefilter Specifying filters------Properties and property values
Haschildfilter
Hasparentfilter
Hassiblingfilter
Isequalfilter
Logical OperationsFilter:
Andfilter------Filters that meet two or more filter conditions at the same time
Notfilter------Non-
Orfilter-------or
Xorfilter
otherFilter:
Nodeclassfilter
Stringfilter filter for filtering sensitive information-------
Linkstringfilter---------Filter for sensitive link information
Linkregexfilter
Regexfilter
Cssselectornodefilter
all theFilterclasses are implemented.Org.htmlparser.NodeFilterinterface. This interface has only one primary function:
Boolean Accept (node node);
each sub-class implements this function, which is used to determine the inputNodedoes it meet thisFilterthe filter condition, if compliant, returnstrue, or returnfalse.
Three: HTML structure parsing diagram description
(1) HTML code
(2) The DOM structure of HTML (that is, the parse tree structure after parser)
(3) Description
- As we can see from the structure diagram, the entire document is a document node.
- Each of the HMTL tags is an element node.
- The text in the tag is the text node.
- The properties of the label are attribute nodes.
- Everything is a node ...
In short, the concept of node tree at a glance, the top is the "root". There is a parent-child relationship between the nodes, ancestors and descendants, sibling relationship. These relations are also very good-looking out, direct connection is the father-son relationship. And a father is a sibling relationship ... More DOM See the
The way of Big data processing (Htmlparser get data < a >)