Jsoup code interpretation of the output of the three-document
Jsoup official note, an important feature is the output tidy HTML. Here we look at how Jsoup is outputting HTML.
HTML related knowledge
Before analyzing the code, let's consider what the "tidy HTML" includes:
- Line wrapping, block-level tagging is customary to monopolize a row
- Indent, nested layers according to HTML tags, the beginning of the indentation will be different
- Strict label closure, self-closing if the label can be self-closing and there is no content
- Escape of HTML Entities
Here's what to add to the knowledge of HTML tags. The HTML tag can be divided into block and inline two categories. The definition of inline and block for tag can refer to http://www.w3schools.com/html/html_blocks.asp, while the Jsoup Tag
class is a very good learning material for Java developers.
Internal static initialisers:Prepped from http://www.w3.org/TR/REC-html40/sgml/dtd.html and other sourcesBlock tags, need to change linePrivateStaticFinalString[] Blocktags = {"HTML","Head","Body","Frameset","Script","NoScript","Style","Meta","Link","Title","Frame","Noframes","section","Nav","Aside","Hgroup","Header","Footer","P","H1","H2","H3","H4","H5","H6","UL","Ol","Pre","Div","Blockquote","HR","Address","Figure","Figcaption","Form","FieldSet","Ins","Del","S","DL","DT","DD","Li","Table","Caption","Thead","Tfoot","Tbody","Colgroup","Col","TR","TH","TD","Video","Audio","Canvas","Details","Menu","PlainText"};Inline tags, no line breaksPrivateStaticFinalString[] Inlinetags = {"Object","Base","Font","TT","I","B","U","Big","Small","Em","Strong","DFN","Code","Samp","KBD","Var","Cite","Abbr","Time","Acronym","Mark","Ruby","RT","RP","A","IMG","BR","WBR","Map","Q","Sub","Sup","BDO","IFrame","Embed","Span","Input","Select","TextArea","Label","Button","Optgroup","Option","Legend","DataList","Keygen","Output","Progress","Meter","Area","Param","Source","Track","Summary","Command","Device"};Emptytags is a label that cannot have content, such tags can be self-closingPrivateStaticFinalString[] Emptytags = {"Meta","Link","Base","Frame","IMG","BR","WBR","Embed","HR","Input","Keygen","Col","Command","Device"};PrivateStaticFinalString[] Formatasinlinetags = {"Title","A","P","H1","H2", "h5", " h6 ", " pre ", " address ", "th", "TD", " script ", " style ", " ins ", "del", "s"}; //in these tags, you need to keep the space private static Span class= "Hljs-keyword" >final string[] preservewhitespacetags = { "pre", "plaintext", "title", "textarea"};
In addition, the Jsoup Entities
class contains something that is escaped by HTML entities. The corresponding data of these escapes is saved in entities-full.properties
and entities-base.properties
.
Format implementation of Jsoup
In Jsoup, a direct call Document.toString()
(inherited from Element) allows the document to be output. In addition, OutputSettings
you can control the output format, mainly prettyPrint
(whether reformatting), outline
(whether to force all label wrapping), indentAmount
(indentation length), and so on.
The inheritance and invocation relationships inside are slightly more complex, presumably like this:
Document.toString()
= Document.outerHtml()
= Element.html()
and eventually Element.html()
loops through all the child elements outerHtml()
, stitching them up as output.
private void html(StringBuilder accum) { for (Node node : childNodes) node.outerHtml(accum);}
Instead outerHtml()
, it uses a OuterHtmlVisitor
pair of child nodes to iterate over and assemble them as a result.
protected void outerHtml(StringBuilder accum) { new NodeTraversor(new OuterHtmlVisitor(accum, getOutputSettings())).traverse(this);}
Outerhtmlvisitor will iterate over all child nodes and invoke node.outerHtmlHead()
and node.outerHtmlTail
two methods.
PrivateStaticClassOuterhtmlvisitorImplementsNodevisitor {private StringBuilder accum; private document.outputsettings out; public void head (node node, int depth) { Node.outerhtmlhead (accum, depth, out);} public void tail (node node, int depth) {if (!node.nodename (). Equals ( "#text")) Saves a void hit. Node.outerhtmltail (accum, depth, out);}
We finally found the real work code, node.outerHtmlHead()
and node.outerHtmlTail
. The output of each node in the Jsoup is not the same, and here we talk about only two main nodes: Element
and TextNode
. Element
is the primary object of the format, and its two method codes are as follows:
void Outerhtmlhead (StringBuilder accum, int depth, document.outputsettingsOut) {if (Accum.Length () > 0 &&Out.prettyprint () && (Tag.formatasblock () | | (parent ()! = null && parent (). Tag (). Formatasblock ()) | |Out.outline ()))Wrap and adjust indent indent (accum, depth,Out); Accum.Append"<").Append (TagName ()); Attributes.html (Accum,Out);if (Childnodes.isempty () && tag.isselfclosing ()) Accum.Append"/>");Else Accum.append ( ">");} void Outerhtmltail (StringBuilder accum, int depth, document.outputsettings out) {if (! ( Childnodes.isempty () && tag.isselfclosing ()) {if ( Out.prettyprint () && (!childnodes.isempty () && (Tag.formatasblock () | | (out.outline () && (childnodes.size () >1 | | (Childnodes.size () ==1 &&! (ChildNodes. get (0) instanceof Textnode))) )) //wrap and adjust indentation indent (accum, depth, out); Accum.append ( Append (TagName ()). append ( ">");}
The code for the Ident method has only one line:
protected void indent(StringBuilder accum, int depth, Document.OutputSettings out) { //out.indentAmount()是缩进长度,默认是1 accum.append("\n").append(StringUtil.padding(depth * out.indentAmount()));}
The code is simple and clear, there is nothing to say. It is worth mentioning that, StringUtil.padding()
in order to reduce the string generation, the usual indentation is saved in an array.
All right, water. An article, the next article will compare the technical content of the parser part.
In addition, through this section of learning, we learned to name StringBuilder Accum, not SB.
Jsoup code interpretation of the output of the three-document