Jsoup code interpretation of the four-parser

Last Update:2016-05-06 Source: Internet

Author: User

Tags emit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jsoup code interpretation of the four-parser

As the best HTML parsing library in the Java World, Jsoup's parser implementations are very representative. This part is also the most complicated part of Jsoup, which requires some knowledge of data structure, state machine and even compiler. Fortunately, HTML syntax is not complex, parsing is only to the DOM tree, so it is quite appropriate to get started as a compiler. This piece do not expect swallowed, we still brew a cup of coffee, savor the mystery of it.

Basic knowledge Compiler

The process of translating computer language into another computer language (usually the lower-level language, such as machine code, assembly, or JVM bytecode) is called compilation (compile). Compiler (Compiler) is an important area of computer science, has been a lot of years of history, and the recent proliferation of common languages, coupled with the rise of cross-language compilation, the popularity of the DSL concept, has made the compiler become a very fashionable thing.

compiler field related to three well-known classic books, Dragon book "Compilers:principles, Techniques, and Tools", Tiger Book "Modern Compiler Implementation in X (X for various languages)", Whale Book "Advanced Compiler Design and implementation". Among them, Dragon book is the perfect choice of compiling theory, and the latter two is more instructive to practice. In addition, the Assembly head has a good compiler. Getting Started Series blog: http://www.cnblogs.com/Ninputer/archive/2011/06/07/2074632.html

The basic flow of the compiler is as follows:

Lexical analysis, syntax analysis, semantic analysis, which is called the compiler's Front end (front-end), and then the intermediate code generation until the target generation, optimization, etc. belong to the compiler's back end (Back-end). The compiler's front-end technology is already mature, there are tools such as YACC to automate lexical, syntactic analysis (there is a similar tool in Java ANTLR), the backend technology is more complex, but also the focus of the current compiler research.

Having said so much, go back to our HTML. HTML is a declarative language that understands that its final output is a graphical page in the browser, not an executable target language, so I changed the translate here to render.

In Jsoup (including similar HTML parser), only Lex (lexical analysis), parse (parsing) two steps, and the final output of the HTML Parse is the DOM tree. As for the semantic parsing and rendering of HTML, take a look at the Ctrip ued team's article: How the Browser works: Rendering engine, HTML parsing.

State machine

The lexical analysis and grammatical analysis of jsoup have been used in the state machine. State machines can be understood as a special program model, such as regular expressions that are often used to deal with us, and are implemented with state machines.

It consists of two parts: State and Transfer (transition). According to the possibility of state transfer, the state machine is divided into DFA (deterministic finite state machine) and NFA (indeterminate finite state automata). Here's a simple regular expression "a[b]*" as an example, we'll first map it to a state machine DFA, presumably like this:

The state machine itself is a programming model, and here we try to implement it with a program, and the most straightforward way is probably this:

public void process(StringReader reader) throws StringReader.EOFException {    char ch;    switch (state) {        case Init:            ch = reader.read();            if (ch == ‘a‘) {                state = State.AfterA;                accum.append(ch);            }            break;        case AfterA:            ... break; case AfterB: ... break; case Accept: ... break; }}

It's no problem to write a simple state machine, but it's a bit uncomfortable in complicated situations. There is also a standard state machine solution that establishes a State transfer table and then uses this table to establish a state machine. The problem with this approach is that only pure state transfers can be done, and input and output cannot be manipulated at the code level.

Jsoup used the state mode to implement state machine, the first time to see, really let a person in front of a bright. State mode is one of the design patterns that binds the state to the corresponding behavior. In the implementation process of state machine, it is suitable to use it to realize state transfer.

The state pattern of the "a[b]*" example is implemented as follows, where the enumeration is used in the same way as the jsoup to implement the state pattern:

public class Statemodelabstatemachine implements Abstatemachine {statestate;    StringBuilder Accum; Enum State {Init {@Override public void process (Statemodelabstatemachine statemodelabstatemachine, StringReader Reader) throws stringreader.eofexception {char ch = reader                . read (); if (ch = = ' A ') {statemodelabstatemachine. state = Aftera; statemodelabstatemachine.accum.append (CH);} }}, Accept {...}, Aftera {...}, Afterb {...}; public void process (Statemodelabstatemachine statemodelabstatemachine, StringReader Reader) throws stringreader.eofexception {}} public void process (StringReader reader) throws Stringreader.eofexception { state.process (this, reader);}

PS: I fork a jsoup code on GitHub, submit this series of articles up, and add Chinese comments to some code, interested to see https://github.com/code4craft/jsoup-learning. The complete implementation of several state machines mentioned in this article is in the https://github.com/code4craft/jsoup-learning/tree/master/src/main/java/us/codecraft/learning path of this warehouse.

Code structure

First introduce the following main classes in the parser package:

Parser
Jsoup Parser's entrance facade, encapsulates the commonly used parse static method. Can be set to maxErrors collect error records, default is 0, which is not collected. The classes associated with it are ParseError , ParseErrorList . Based on this feature, I wrote a PageErrorChecker syntax check for the page and output a syntax error.
Token
Saves a single lexical analysis result. Token is an abstract class, and its implementation has,,,, Doctype StartTag EndTag Comment Character EOF 6 kinds, corresponding to 6 lexical types.
Tokeniser
Preserves the state and results of the lexical analysis process. The more important two fields are the state and emitPending , the former saves the state, and the latter saves the output. Next there is tagPending / doctypePending / commentPending , save has not filled the full token.
CharacterReader
A wrapper over the logic of the read character, used for tokenize character input. Characterreader contains such usages as Bytebuffer,,, and consume() unconsume() mark() rewindToMark() advanced in NiO consumeTo() .
TokeniserState
A lexical analysis state machine implemented with enumerations.
HtmlTreeBuilder
Syntax parsing, building a class of dom trees through tokens.
HtmlTreeBuilderState
Grammar analysis State machine.
TokenQueue
Although wearing a token of the vest, in fact, when the query is used, stay to the select part of the talk.

Lexical Analysis State machine

Now let's talk about the lexical parsing process of HTML. Here is a picture of the http://ued.ctrip.com/blog/?p=3295, which describes the state transfer process of a tag tag,

This ignores HTML annotations, entities, and attributes, leaving only the basic start/end tags, such as the following HTML:

<div>test</div>

Jsoup Lexical analysis is more complex, I extracted the corresponding part from the inside, it became our minisouplexer (here omitted part of the code, the complete code can see here MiniSoupTokeniserState ):

Enum Minisouptokeniserstate implements Itokeniserstate {/** * What level does not have a status *?     * <div>test</div> *? * <div>test</div> */Data {//In data state, gather characters until a character reference or tag are found public void read (Tokeniser t, characte Rreader R) {switch (r.current ()) {case ' < ': T.advancetransition (Tagopen);                Break ; Case Eof:t.emit (New token.eof ()); Break ; Default:string data = R.consumetoany (' & ', ' < ', Nullchar); t.emit (data); Break ;} } }, /** * ? * <div>test</div> */tagopen {... },/** *? * <div>test</div> */endtagopen {... },/** * ? * <div>test</div> * * TagName {... };}

Referring to this procedure, we can see the general idea of lexical analysis of Jsoup. Writing the parser itself is a tedious process involving attribute values (distinguishing between single and double quotes), DocType, annotations, HTML entities, and some error conditions. But understanding the idea, code implementation is a step-by-step process.

Recently life a little busy, daughter always stay awake at night, mental state is not very good. At work, there are many ideas but few are recognized, and some things do not say that the code is well written. Forget, or correct attitude, after all, seniority is still shallow, I still continue my.

Read Jsoup source code is not boring, in fact, in order to WebMagic do a little better, after all, parser is also an important part of the reptile. After reading the code, a lot of harvest, the knowledge of HTML is also further.

DOM Tree generation process

The TreeBuilder partial extraction here is called the parsing process may be slightly inappropriate, in fact, the process of generating a DOM tree from token, but I still follow the name in this compiler.

TreeBuilderThe same is a facade object, and the real parsing is the following piece of code:

protected void runParser() {    while (true) {        Token token = tokeniser.read();        process(token); if (token.type == Token.TokenType.EOF) break; }}

TreeBuilderThere are two sub-classes, HtmlTreeBuilder and XmlTreeBuilder . XmlTreeBuildernaturally it is the class that constructs the XML tree, the implementation is quite simple, basically maintains a stack, and inserts the node according to the different tokens:

@OverrideProtectedBooleanProcess(Token token) {Start tag, end tag, doctype, comment, character, EOFSwitch (token.type) {case Starttag:insert (Token.asstarttag ()); break; case Endtag:popstacktoclose (Token.asendtag ()); break; case Comment:insert (Token.ascomment ()); break; case Character:insert (Token.ascharacter ()); break; case Doctype:insert (Token.asdoctype ()); break; case EOF://Could put some normalisation here if desired break; default:validate.fail ( "unexpected token type:" + Token.type);} return true;}

insertNodeThe code is roughly what it looks like (for ease of presentation, some consolidation of the method):

Element Insert (Token. Starttag Starttag) {TagTag=Tag. ValueOf (Starttag. name ()); Element El=new Element (tag, BaseUri, Starttag .attributes); stack.getlast () if (Starttag.isselfclosing ()) {Tokeniser.acknowledgeselfclosingflag (); if (! Tag.isknowntag ()) //unknown tag, remember this was self closing For output. See above. tag.setselfclosing (); else {stack.add (EL);} return el;}

HTML Parsing state machine

Compared to the XmlTreeBuilder HtmlTreeBuilder implementation is more complex, in addition to a similar stack structure, but also used to HtmlTreeBuilderState build a state machine to analyze the HTML. What is this for? Take a look at what HtmlTreeBuilderState you're using (state in your code):

<!--state:initial--<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd "><!--state:beforehtml--<Htmllang=' ZH-CN 'xml:lang=' ZH-CN 'xmlns=' Http://www.w3.org/1999/xhtml ' ><!--State:beforehead--<Head><!--State:inhead--<ScriptType="Text/javascript" >//<!--state:text---function xx () {}</Script><Noscript><!--state:inheadnoscript--Your browser does not support javascript!</Noscript></Head><!--State:afterhead--<Body><!--State:inbody--<Textarea><!--state:text---xxx</Textarea><table> <!--state:intable-- <!--state:intabletext-xxx < tbody> <!--state:intablebody-- </tbody> < tr> <!--state:inrow--<< Span class= "Hljs-title" >td> <!--State:incell--</< Span class= "Hljs-title" >td> </tr> </table></ HTML>

As you can see, HTML tags have nesting requirements, for example, they need to be <tr> <td> combined for <table> use. According to the code of Jsoup, you can find that you HtmlTreeBuilderState have done the following things:

Grammar check

For example tr table , a syntax error is not nested within a tag. An InBody error occurs when the following tag is directly present in the state. Jsoup encountered this error, will find the token parsing and log errors, and then continue to parse the following content, and does not exit directly.

InBody {    boolean process(Token t, HtmlTreeBuilder tb) {        if (StringUtil.in(name,        "caption", "col", "colgroup", "frame", "head", "tbody", "td", "tfoot", "th", "thead", "tr")) { tb.error(this); return false; } }

Label completion

For example, if the head label is not closed, it writes some labels that are only allowed inside the body, and then closes automatically . HtmlTreeBuilderStateSome methods anythingElse() provide automatic completion tags, such as InHead the status of the auto-closing code as follows:

 private boolean anythingElse(Token t, TreeBuilder tb) { tb.process(new Token.EndTag("head")); return tb.process(t); }

There is also a way to close the label, such as the following code:

private void closeCell(HtmlTreeBuilder tb) { if (tb.inTableScope("td")) tb.process(new Token.EndTag("td")); else tb.process(new Token.EndTag("th")); // only here if th or td in scope}

What happens when a case study is missing a label?
Well, see so many parser source code, may wish to return to our daily application come up. We know that it is normal to write more than one two unclosed tags on a page, so how will they be parsed?
Take <div> the label for example:
 
  
  
Omitted to write the start tag, only the end tag was written
 case endtag: if ( Stringutil.in (Name, "div",  "DL",  " FieldSet ", " Figcaption ", " figure ",  "Footer",  "header",  "pre",  "section ", " summary ", " ul ") {if (! Tb.inscope (name)) {Tb. Error (this); return FALSE;}  
Congratulations, this will be treated </div> as a mistake, so your page is undoubtedly out of the mess! Of course, if you write more than one </div> , it seems that will not have any effect oh? (Remember someone told me to prevent the label from closing, and write a few more stories at the bottom </div> of the page)
 
  
  Wrote the start tag and omitted the end tag.
The situation is a little more complicated to analyze. If it is a label that cannot nest content internally, it is closed when an unacceptable label is encountered. The <div> label can include most tags, in which case the scope lasts until the end of the HTML. 
 
Well, the parser series is the end of the analysis, in the meantime learned a lot of HTML and state machine content, but far from the actual use. Starting with the Select section below, this section may be more meaningful for everyday use.
Jsoup code interpretation of the four-parser

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More