Iii. parsing and the construction of the DOM tree 1, Analysis:Since parsing the rendering engine is a very important process, we will step into it, and now let's introduce the parsing.Parsing a document means converting it into a meaningful structure-something that the code can understand and use, the result of which is usually a collection of nodes of a tree, used to represent the structure of a document, which is called a parse tree or a syntax tree.Example:Parse expression "2 +3-1 ", return the tree as3.1 1), Syntax: Parsing is done based on the grammar rules that the document follows-the language or format in which it is written. Each format that can be parsed must consist of a defined grammar and vocabulary. This is called context-independent syntax. Human language is not the language, so it cannot be resolved with conventional analytic techniques. 2), parser-Analyzer combination: Parser has two processing processes-lexical analysis and grammatical analysis. Lexical analysis is responsible for dividing the input into symbolic sequences, which are the words of the language--all the legal words of the language. Grammatical analysis is the application of the grammatical rules of the language. Parsers usually divide the work into two components-lexical analysis programs (sometimes called word breakers) are responsible for dividing the input into a sequence of legal symbols, the parser is responsible for parsing the document structure and constructing the syntax tree according to the grammatical rules. The lexical parser knows how to filter extraneous characters such as spaces and line breaks. Such as3.1.2 The parsing process is iterative. Parsers usually get new symbols from the lexical parser and try to match the syntax rules. If the match succeeds, create the corresponding node on the syntax tree and continue to get the next symbol from the lexical parser. If there are no matching rules, the parser saves the symbol internally and continues to get the symbol from the lexical parser until all the symbols that are stored internally can successfully match a rule. If the final mismatch is not matched, the parser throws an exception. This means that the document is invalid and contains syntax errors. 3), Conversion: In most cases, the parse tree is not the final result, and the parsing is often used to convert-the input document to another format, such as a compiler to compile the source code into a machine, the first parse into a parse tree, and then converted into machine code, such as3.1.3 4), analytic example: In the figureIn 3.1, we built a parse tree of mathematical expressions, let's try to define a simple mathematical language and see how the parsing is done. Vocabulary: Our language can contain integers, plus and minus signs. Grammar: 1>. Syntax consists of expressions, terms, and operators 2> Our language can contain expressions of any number type 3> The expression is defined as a term followed by an operator, followed by another term. 4> operator is a plus sign and a minus sign 5> The term is an integer or an expression Now let's analyze the input "2 +3-1 ": The first sub-string that conforms to the rule is "2 ″, according to the rules #5 It is a term. The second match is "2 +3″, in accordance with the second rule--a term followed by an operator followed by another term. The next match appears at the end of the input. ”2 +3–1″ is an expression because we know that "2 +3 "is a term, so it conforms to the second rule. “2 + + "does not match any rules, so it is invalid for input. 5), lexical and syntactic definition of legality: Words are usually expressed in regular expressions. For example, our language can be defined as: INTEGER:0| [1-9][0-9]* Plus: + Minus:- As you can see, the integral type is defined by a regular expression. The syntax is commonly defined in the BNF format, and our language is defined as: Expression: = term operation term Operation: = PLUS | Minus Term: = INTEGER | Expression We said that the regular parser can only parse the language of the context-independent syntax. An intuitive definition of this language is that its syntax can be fully expressed in BNF. Refer to HTTP for the specification definition:En.wikipedia.org/wiki/context-free_grammar 6), type of parser: There are two basic types of parsers-top-down parsers and bottom-up parsers. Subjectively, the top-down parser tries to match the syntax from the upper syntactic structure, and the bottom-up starts from the input and slowly translates into syntactic rules, starting with the underlying rules, until the upper-level rules are all matched. Let's look at how these two parsers will parse our example: The top-down parser starts with the upper-level rule and it puts the "2 +3″ is defined as an expression and then defines "2 +3–1″ is an expression (the process of defining an expression also matches other rules, but the starting point is the highest level rule). The bottom-up parser scans the input until there is a matching rule, which replaces the input with the rule. This until the input is finished. Partially matched rules are put into the parsing stack. Such as:3.1.6 This bottom-up parser is called the shift-to-go parser because the input is moved to the right (imagine that a pointer moves from point to input and gradually to the right) and gradually to the syntax tree. 7) Automatically create parser There are tools that can create parsers for you, which are often referred to as parser generators. You just have to provide the syntax--vocabulary and sentence rules--it can generate a working parser. Creating a parser requires a deep understanding of the parser, and it is not easy to create an optimized parser manually, so the parser generation tool is useful. WebKit uses two well-known parser generation tools: Flex is used to create the lexical parser, and bison is used to create parsers (you might see them as Lex and yacc names exist). Flex's input file is the regular expression definition of the symbol, and the input file of the bison is the syntactic definition of the BNF format. 2.HTML parser:The work of the HTML parser is to parse the HTML markup into the parse tree 1) HTML Syntax definition The lexical and syntactic definitions of HTML are defined in the specifications created by the organization. The current version of HTML4,HTML5 is working in progress. 2) Not context-independent syntax As seen in the introduction to parsers, syntax can be defined in a formal format that is similar to BNF. Unfortunately, all of the general parser discussions are not available for HTML (I mention that they are not for entertainment, they can be used to parse CSS and JavaScript). HTML cannot be defined with the context-independent syntax required by the parser. The past HTML format specification is defined by a DTD (Document Type definition), but it is not a context-independent syntax. HTML is pretty close to XML. There are many parsers available for XML. HTML also has an XML variant called XHTML, so what are the main differences? The difference is that the HTML app is more "forgiving", allowing you to omit some start or end tags. It's all a "soft" syntax, not as rigid as XML. In general, this seemingly subtle difference has resulted in two different worlds. On the one hand this makes HTML popular because it embraces your mistakes and makes life easier for Web authors. On the other hand, it makes it difficult to write syntax formatting. Therefore, in general, HTML parsing is not simple, out-of-the-box context-sensitive parser can not, XML parser. 3) HTML DTD The definition of HTML uses a DTD file. This format is used to define the SGML family language, which contains definitions of all allowed elements, including their attributes and hierarchical relationships. As we said earlier, HTML DTD is not context-independent syntax. There are several different types of DTDs. Strict mode is fully compliant, but other modes for forward compatibility may include support for tags used by earlier browsers. The current strict mode dtd:http:Www.w3.org/TR/html4/strict.dtd 4) DOM The tree that the parser outputs is made up of DOM elements and attribute nodes. The full name of the DOM is: Document Object Model. It is an object-like description of an HTML document and an interface between HTML elements and the outside world (such as JavaScript). The DOM has almost one by one corresponding relationships with tags, as follows: <Body> <P>hello World</P> <div><img src= "aa.png"/></div> </body> Span class= "indent" > </html>
When we say that the tree contains DOM nodes, it means that the tree is made up of elements that implement the DOM interface. These implementations contain some other properties that are required internally by the browser.
Introduction to the DOM tree, and principle analysis