How browsers work: parsing and DOM tree construction

Last Update:2018-12-04 Source: Internet

Author: User

Tags format definition tagname

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Iii. parsing and DOM tree construction 1. parsing: As the parsing rendering engine is a very important process, we will go deep step by step. Now let's introduce the parsing. Parsing a document means converting it into a meaningful structure-something the code can understand and use. The parsing result is usually a node set of a tree, which is used to represent the document structure, it is called a parsing tree or a syntax tree. Example: The parsing expression "2 + 3-1". The returned tree is 3.1.

1) Syntax: Parsing is based on the syntax rules followed by the document-the language or format used for writing. Each resolvable format must consist of a definite syntax and vocabulary. This is called context-independent syntax. Human language is not such a language, so it cannot be parsed using conventional parsing technology. 2) parser-analyzer combination: the parser has two processing processes: lexical analysis and syntax analysis. Lexical analysis divides the input into a sequence of symbols, which are the words of a language. They are composed of all valid words in the language. Syntax analysis is an application of the syntax rules of the language. The parser usually distributes the work to two components-the lexical analysis program (sometimes called word divider), which is responsible for dividing the input into valid symbol sequences, the parser analyzes the document structure and constructs the syntax tree according to the syntax rules. The lexical analysis program knows how to filter irrelevant characters such as spaces and line breaks. For example, 3.1.2

The parsing process is iterative. The parser usually obtains new symbols from the lexical analyzer and tries to match the syntactic rules. If the match is successful, create the corresponding node on the syntax tree and continue to obtain the next symbol from the lexical analyzer. If no matching rule exists, the parser saves the symbol internally and continues to obtain the symbol from the lexical analyzer until all the characters saved internally match a rule. If the final match fails, the parser throws an exception. This means that the document is invalid and contains syntactic errors. 3) Conversion: In most cases, the parsing tree is not the final result and is often used for conversion-the input document is converted to another format, for example, if a compiler needs to compile the source code into a machine code, it will first parse it into a parsing tree and then convert it into a machine code, for example, 3.1.3.

4) parsing example: In Figure 3.1, we construct a mathematical expression parsing tree. Let's try to define a simple mathematical language and see how parsing works. Vocabulary: our language can contain integers, plus signs, and minus signs. Syntax: 1>. the syntax consists of expressions, terms, and operators. 2>. our language can contain expressions of any numeric type 3>. the expression is defined as a term followed by an operator, followed by another term. 4>. operator is a plus sign and a minus sign 5>. the term is an integer or expression. Now let's analyze the input "2 + 3-1": the first sub-string that complies with the rule is "2". According to Rule #5, it is a term. The second match is "2 + 3" and complies with the second rule-a term follows one operator and another term. The next match appears at the end of the Input ." 2 + 3-1 "is an expression, because we know that" 2 + 3 "is a term, so it complies with the second rule. "2 + +" does not match any rules, so it is invalid input. 5) definition of the legitimacy of lexical and syntaxes: commonly used regular expressions are used to express word aggregation. For example, our language can be defined as: INTEGER: 0 | [1-9] [0-9] * Plus: + minus:-as you can see, an integer is defined by a regular expression. Commonly used syntax BNF format definition, our language is defined as: expression: = Term Operation termoperation: = plus | minusterm: = integer | expression we have said that the regular parser can only parse languages with context-independent syntax. An intuitive definition of this language is that its syntax can be fully expressed using BNF. See the http://en.wikipedia.org/wiki/Context-free_grammar6 for its specification definition), the parser type: the parser has two basic types-top-down parser and bottom-up parser. Subjectively, the top-down parser tries to match the syntaxes starting from the upper-level syntaxes. The bottom-up parser starts from the input and gradually converts it into syntactic rules, starting from the underlying rules, until all upper-layer rules match. Let's take a look at how these two Resolvers will parse our example: the top-down parser starts from the upper-layer rule and defines "2 + 3" as an expression, then define "2 + 3-1" as the expression (other rules are also matched when the expression is defined, but the starting point is the highest level rule ). The bottom-up parser scans the input until a matching rule exists. It replaces the input with the rule. The input ends. Some matching rules are placed in the Parsing Stack. Example: 3.1.6

7) Automatic parser there are some tools that can be used to create a parser for you. They are usually called a parser generator. You only need to provide the syntax-Vocabulary and syntax rules-to generate a working parser. Creating a parser requires a deep understanding of the parser, and it is not easy to manually create an optimized parser. Therefore, the parser generation tool is very useful. WebKit uses two well-known parser generation tools: Flex is used to create a lexical analyzer, And Bison is used to create a parser (You may see that they exist in names of lex and YACC ). The flex input file is the regular expression definition of the symbol, and the bison input file is the syntax definition in BNF format. 2. HTML Parser: the HTML Parser parses HTML tags into the parsing tree. 1) HTML syntax defines the syntax and syntax of HTML in the W3C organization. The current version is html4, and HTML5 is in progress. 2) It is not a context-independent syntax. We can see in introduction to the parser that the syntax can be defined in a format similar to BNF. Unfortunately, all general parser discussions are not applicable to HTML (I mentioned them for entertainment, they can be used to parse CSS and JavaScript ). HTML cannot be defined using the context-independent syntax required by the parser. In the past, the HTML format specification was defined by document type definition, but it is not a context-independent syntax. HTML is quite similar to XML. XML has many available Resolvers. Another XML variant in HTML is XHTML. What are the main differences between them? The difference is that HTML applications are more "tolerant" and allow you to miss some start or end tags. It is a "soft" syntax, not as rigid as XML. In general, this seemingly subtle difference creates two different worlds. On the one hand, HTML is very popular, because it embraces your mistakes and makes the life of webpage authors easy. On the other hand, it makes it difficult to write the syntax format. Therefore, HTML Parsing is not simple, and the context parser is not feasible. 3) The definition of HTML dtdhtml uses the DTD file. This format is used to define the SGML language. It contains definitions of all allowed elements, including their attributes and hierarchical relationships. As we mentioned earlier, the html dtd does not constitute a context-independent syntax. DTD has several different types. Strict mode is fully compliant with specifications, but other modes may include support for labels used by earlier browsers for forward compatibility. The current strict pattern DTD: http://www.w3.org/TR/html4/strict.dtd4) the tree output by the DOM parser is composed of DOM elements and attribute nodes. Dom is called the Document Object Model. It is an object description of HTML documents and an interface between HTML elements and external elements (such as JavaScript. The Relationship Between Dom and tags is almost one-to-one, as shown below: <HTML> <body> <p> Hello World </P> <div> </div> </body>

Like HTML, Dom specifications are also developed by W3C. Reference: http://www.w3.org/DOM/DOMTR. This is a general specification for operational documentation. There is a dedicated module that defines the unique HTML element: unique. These implementations contain the attributes required by other browsers.

5) as we can see in the previous chapter, HTML cannot be parsed using a conventional top-down or bottom-up Parser for the following reasons: a> HTML is a tolerant language B> In fact, browsers have traditional errors to support HTML tags that are generally known to be invalid. C> the parsing process repeats. Generally, the source is not changed during the analysis, but the HTML Script contains "document. the write label can be used to add additional labels. Therefore, the input is modified during the analysis. The conventional parsing technology cannot be used. The browser creates a custom parser to parse HTML. In HTML5, This parsing algorithm is described in great detail. The algorithm consists of two phases: the tagging algorithm and the tree construction algorithm. Symbolic is the token input for lexical analysis and syntax analysis. The start tag, end tag, attribute name, and attribute value of the HTML Tag. The token is assigned to the recognition token, which is provided to the next character in the tree construction and consumption, and so on until the next mark is entered. Figure 3.2.5 below (HTML parsing traffic (taken from the HTML5 Specification ))

6) tagging algorithm the output of this algorithm is an HTML Tag. The algorithm is represented as a state machine. The input stream of one or more characters consumed by each State. The next state is updated based on these characters. This decision is influenced by the current marked State and tree construction state. This means that consuming the same characters in the next correct state will produce different results, depending on the current state. The more complex and adequate the algorithm is, let's take a look at a simple example to help us further understand it. Basic example-tag the following HTML: <HTML> <body> world, hello </body>

When the parser is created, the Document Object is also created. During tree construction, the root node of the DOM tree will be modified, and the elements will be added to it. Nodes completed by each word divider are processed by the tree builder. The Specification defines the DOM object associated with each symbol. In addition to adding an element to the DOM tree, it is also added to an open element stack. This stack is used to correct nested errors and labels that are not closed. This algorithm is also described by the state machine. Its state is called "insertion Modes ". Let's take a look at the following tree construction process: <HTML> <body> Hello World </body>

8) when the resolution is complete, the browser marks the document as the interaction mode at this stage and starts parsing the deferred script ." Deferred "means the script should be executed after the document Parsing is complete. After the script is processed, it enters the "complete" status and the "LOAD" event occurs. HTML5 specification contains the complete algorithm: http://www.w3.org/TR/html5/syntax.html#html-parser9) browser fault tolerance you will never see HTML page syntax errors. The browser fixes the error and continues. Take a look at the following example: <HTML> <mytag> </mytag> <div> <p> </div> really lousy HTML </P>

4. parsing scripts will be detailed in the Javascript chapter 5. process the script and the sequence script of the style sheet: The Web mode is the synchronization mode. The authors expect that the script can be parsed and executed immediately when the parser resolves to a <SCRIPT> tag. The script is executed and the parsing of the document is paused. If the <SCRIPT> script is introduced from outside, it must be obtained from the network first. This is also synchronous. The parsing is paused until the resource is obtained. This model has been used for many years and is also written into the html4 and HTML5 specifications. The author can add a defer = "Defer" attribute to the <SCRIPT> label, so that the parsing of the document is not paused. After the parsing is complete, the script is executed. HTML5 adds an async attribute to <SCRIPT>, which enables parsing of documents and executing scripts in different threads. Speculative analysis: Both WebKit and Firefox are optimized in this way. When the script is executed, the other thread parses the remaining parts of the document, finds other resources to be loaded from the network, and loads them. The overall speed at which resources in these methods can be loaded by parallel links is better. Note-the speculative parser does not modify the DOM tree, node, or primary analyzer. It only parses external scripts, style sheets, images, and other external resource references. Style Sheet: style sheets have different patterns on the other hand. In terms of concept, it seems that the style sheet does not change the DOM tree, and there is no reason to wait or stop parsing documents. There is a problem. when parsing a document, the script accesses the style information. If the style is loaded and parsed, the script will get an incorrect answer and cause a series of problems. This looks like an edge situation, but it is quite common that a style sheet in Firefox will block all scripts while loading and parsing. WebKit block scripts block all scripts only when they try to access certain style attributes that may uninstall the style sheet.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More