Jsoup code interpretation of the output of the three-document

Source: Internet
Author: User
Tags tagname tidy

Jsoup code interpretation of the output of the three-document

Jsoup official note, an important feature is the output tidy HTML. Here we look at how Jsoup is outputting HTML.

HTML related knowledge

Before analyzing the code, let's consider what the "tidy HTML" includes:

    • Line wrapping, block-level tagging is customary to monopolize a row
    • Indent, nested layers according to HTML tags, the beginning of the indentation will be different
    • Strict label closure, self-closing if the label can be self-closing and there is no content
    • Escape of HTML Entities

Here's what to add to the knowledge of HTML tags. The HTML tag can be divided into block and inline two categories. The definition of inline and block for tag can refer to http://www.w3schools.com/html/html_blocks.asp, while the Jsoup Tag class is a very good learning material for Java developers.

Internal static initialisers:Prepped from http://www.w3.org/TR/REC-html40/sgml/dtd.html and other sourcesBlock tags, need to change linePrivateStaticFinalString[] Blocktags = {"HTML","Head","Body","Frameset","Script","NoScript","Style","Meta","Link","Title","Frame","Noframes","section","Nav","Aside","Hgroup","Header","Footer","P","H1","H2","H3","H4","H5","H6","UL","Ol","Pre","Div","Blockquote","HR","Address","Figure","Figcaption","Form","FieldSet","Ins","Del","S","DL","DT","DD","Li","Table","Caption","Thead","Tfoot","Tbody","Colgroup","Col","TR","TH","TD","Video","Audio","Canvas","Details","Menu","PlainText"};Inline tags, no line breaksPrivateStaticFinalString[] Inlinetags = {"Object","Base","Font","TT","I","B","U","Big","Small","Em","Strong","DFN","Code","Samp","KBD","Var","Cite","Abbr","Time","Acronym","Mark","Ruby","RT","RP","A","IMG","BR","WBR","Map","Q","Sub","Sup","BDO","IFrame","Embed","Span","Input","Select","TextArea","Label","Button","Optgroup","Option","Legend","DataList","Keygen","Output","Progress","Meter","Area","Param","Source","Track","Summary","Command","Device"};Emptytags is a label that cannot have content, such tags can be self-closingPrivateStaticFinalString[] Emptytags = {"Meta","Link","Base","Frame","IMG","BR","WBR","Embed","HR","Input","Keygen","Col","Command","Device"};PrivateStaticFinalString[] Formatasinlinetags = {"Title","A","P","H1","H2", "h5", " h6 ", " pre ", " address ",  "th",  "TD", " script ", " style ", " ins ",  "del",  "s"}; //in these tags, you need to keep the space private static Span class= "Hljs-keyword" >final string[] preservewhitespacetags = {  "pre",  "plaintext",  "title",  "textarea"};             

In addition, the Jsoup Entities class contains something that is escaped by HTML entities. The corresponding data of these escapes is saved in entities-full.properties and entities-base.properties .

Format implementation of Jsoup

In Jsoup, a direct call Document.toString() (inherited from Element) allows the document to be output. In addition, OutputSettings you can control the output format, mainly prettyPrint (whether reformatting), outline (whether to force all label wrapping), indentAmount (indentation length), and so on.

The inheritance and invocation relationships inside are slightly more complex, presumably like this:

Document.toString()= Document.outerHtml() = Element.html() and eventually Element.html() loops through all the child elements outerHtml() , stitching them up as output.

private void html(StringBuilder accum) { for (Node node : childNodes) node.outerHtml(accum);}

Instead outerHtml() , it uses a OuterHtmlVisitor pair of child nodes to iterate over and assemble them as a result.

protected void outerHtml(StringBuilder accum) { new NodeTraversor(new OuterHtmlVisitor(accum, getOutputSettings())).traverse(this);}

Outerhtmlvisitor will iterate over all child nodes and invoke node.outerHtmlHead() and node.outerHtmlTail two methods.

PrivateStaticClassOuterhtmlvisitorImplementsNodevisitor {private StringBuilder accum; private document.outputsettings out; public void head (node node, int depth) { Node.outerhtmlhead (accum, depth, out);} public void tail (node node, int depth) {if (!node.nodename (). Equals ( "#text"))  Saves a void hit. Node.outerhtmltail (accum, depth, out);}        

We finally found the real work code, node.outerHtmlHead() and node.outerHtmlTail . The output of each node in the Jsoup is not the same, and here we talk about only two main nodes: Element and TextNode . Elementis the primary object of the format, and its two method codes are as follows:

void Outerhtmlhead (StringBuilder accum, int depth, document.outputsettingsOut) {if (Accum.Length () > 0 &&Out.prettyprint () && (Tag.formatasblock () | | (parent ()! = null && parent (). Tag (). Formatasblock ()) | |Out.outline ()))Wrap and adjust indent indent (accum, depth,Out); Accum.Append"<").Append (TagName ()); Attributes.html (Accum,Out);if (Childnodes.isempty () && tag.isselfclosing ()) Accum.Append"/>");Else Accum.append ( ">");} void Outerhtmltail (StringBuilder accum, int depth, document.outputsettings out) {if (! ( Childnodes.isempty () && tag.isselfclosing ()) {if ( Out.prettyprint () && (!childnodes.isempty () && (Tag.formatasblock () | | (out.outline () && (childnodes.size () >1 | | (Childnodes.size () ==1 &&! (ChildNodes. get (0) instanceof Textnode))) )) //wrap and adjust indentation indent (accum, depth, out); Accum.append ( Append (TagName ()). append ( ">");}        

The code for the Ident method has only one line:

protected void indent(StringBuilder accum, int depth, Document.OutputSettings out) { //out.indentAmount()是缩进长度,默认是1 accum.append("\n").append(StringUtil.padding(depth * out.indentAmount()));}

The code is simple and clear, there is nothing to say. It is worth mentioning that, StringUtil.padding() in order to reduce the string generation, the usual indentation is saved in an array.

All right, water. An article, the next article will compare the technical content of the parser part.

In addition, through this section of learning, we learned to name StringBuilder Accum, not SB.

Jsoup code interpretation of the output of the three-document

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.