Rendering engine, HTML parsing

Source: Internet
Author: User
Tags tag name tagname xml parser

This is how the translation of browser to workTurn from: Ctrip Design CommitteeRendering engine

The duty of the rendering engine is ... Rendering, that is, the content of the request is displayed on the browser screen.

By default, the rendering engine can display Html,xml documents as well as pictures. It can display other types of documents through plugins (browser extensions). For example, use the PDF Viewer plugin to display PDF files. We will discuss plugins and extensions in a dedicated section. In this section we will focus on the main purpose of the rendering engine-displaying HTML and images formatted with CSS.

Various rendering engines

We mentioned Firefox, Safari two browsers built on two rendering engines: Firefox uses Gecko--mozilla's own rendering engine, and Safari and Chrome use Webkit.

Webkit is an open-source rendering engine that originates from an engine on the Linux platform that has been modified by Apple to support Mac and Windows platforms. For more information, refer to: http://webkit.org/.

Main process

The rendering engine begins to fetch the requested content from the network layer, typically a block of data not exceeding 8K. The next step is the basic workflow of the rendering engine:

Figure 2: Basic workflow of the rendering engine (parsing HTML build DOM tree, render tree build, render tree layout, draw render tree).

The rendering engine parses the HTML document and transforms the label into a DOM node in the content tree. It parses style data from a STYLE element and an external file. The display controls in style data and HTML are used together to create another tree-the rendering tree.

The render tree contains rectangles with display properties such as color, size, and so on. The order of these rectangles is consistent with the order in which they are displayed.

When the render tree is built, it is "layout" processing, which determines exactly where each node appears on the screen. The next step is to draw-traverse the render tree and draw each node with the UI back-end layer.

Be sure to understand that this is a slow process, for a better user experience, the rendering engine will try to display the content as quickly as possible. It does not wait until all of the HTML has been parsed to create and layout the render tree. It will show the processed parts at the same time as it handles the subsequent content.

Main process Examples


Figure 3:webkit Main flowchart 4:mozilla Gecko rendering engine main flow (3.6)

As can be seen from figures 3 and 4, although WebKit and gecko use slightly different terms, the process is essentially the same.
Gecko a well-formed visual element called a "frame tree". Each element is a frame. Webkit uses the term "render tree", which is made up of "render objects". Webkit uses "layout" to represent the layout of elements, and gecko is called "Reflow". WebKit uses "Attachment" to connect DOM nodes with visual information to build the render tree. A non-semantic minor difference is that Gecko has an additional layer between HTML and the DOM tree, called "content sink," which is the factory that creates DOM objects. We will discuss each part of the process.

Analytical

Because parsing is a very important process in the rendering engine, we'll talk a little bit deeper. Let's start with a small parsing introduction.

Parsing a document means translating it into meaningful structures for your code to use. The result of parsing is usually a tree of nodes that characterize a document, called a parse tree or a syntactic tree.

Example--Parse expression ' 2 + 3–1″ can return the following tree:


Figure 5: Mathematical Expression tree-node syntax

Parsing is done based on the grammar rules that the document follows-the language or format in which it is written. Each format that can be parsed must consist of a defined grammar and vocabulary. This is called context-independent syntax. Human language is not the language, so it cannot be resolved with conventional analytic techniques.

Parser-Lexical analyzer combination

The parser has two processing processes-lexical analysis and syntactic analysis.

Lexical analysis is responsible for dividing the input into symbolic sequences, which are the words of the language--all the legal words of the language.

Syntactic analysis is the application of the syntactic law of the language.

The parser usually divides the work into two components-the word breaker is responsible for dividing the input into the legal symbol sequence, and the parser is responsible for analyzing the document structure and constructing the syntactic tree according to the sentence rules. The lexical analyzer knows how to filter extraneous characters such as spaces and line breaks.


Figure 6: From the source document to the parse tree (document, lexical analysis, parsing, parse tree).

The parsing process is interactive. Parsers usually get new symbols from the lexical parser and try to match the syntax rules. If the match succeeds, create the corresponding node on the syntax tree and continue to get the next symbol from the lexical parser. If there are no matching rules, the parser saves the symbol internally and continues to get the symbol from the lexical parser until all the symbols that are stored internally can successfully match a rule. If the final mismatch is not matched, the parser throws an exception. This means that the document is invalid and contains syntax errors.

Transformation

In most cases, the parse tree is not the final result. Parsing is often done to convert from an input document to another format. For example, the compiler to the source code compiled into machine code, will first parse into a parse tree, and then convert the parse tree into machine code.


Figure 7: Compilation process (source code, parse, parse tree, conversion, machine code). Parsing examples

In Figure 5 We build a parse tree of mathematical expressions. Let's try to define a simple mathematical language and see how the parsing is done.

Vocabulary: Our language can contain integers, plus and minus signs.

Syntactic:

    1. Syntactic blocks consist of expressions, terms, and operators.
    2. Our language can contain an arbitrary number of expressions.
    3. An expression is defined as a term followed by an operator, followed by another term.
    4. The operator is a plus or minus sign.
    5. The term can be an integer or an expression.

Let's analyze the input "2 + 3–1″.

The first substring that conforms to the rule is "2″, according to Rule # # It is a term. The second match is "2 + 3″, conforming to the second rule--a term followed by one operator followed by another term." The next match appears at the end of the input. "2 + 3–1″ is an expression because we know that" 2+3 "is a term, so it conforms to the second rule. "2 + +" does not match any rules, so it is invalid for input.

The definition of the legality of morphology and syntax

Words are usually expressed in regular expressions.

For example, our language can be defined as:

integer:0| [1-9] [0-9]*plus: +minus:-

As you can see, the integral type is defined by a regular expression.

The syntax is commonly defined in the BNF format, and our language is defined as:

Expression: =  term  operation  termoperation: =  PLUS | Minusterm: = INTEGER | Expression

We said that the regular parser can only parse the language of the context-independent syntax. An intuitive definition of this language is that its syntax can be fully expressed in BNF. Refer to Http://en.wikipedia.org/wiki/Context-free_grammar for its specification definition

Type of parser

There are two basic types of parsers-top-down parsers and bottom-up parsers. Subjectively, the top-down parser tries to match the syntax from the upper syntactic structure, and the bottom-up starts from the input and slowly translates into syntactic rules, starting with the underlying rules, until the upper-level rules are all matched.

Let's look at how these two parsers will parse our example:

The top-down parser starts with the upper-level rule, which defines "2 + 3″ as an expression, and then defines" 2 + 3–1″ as an expression (the process of defining an expression also matches other rules, but the starting point is the highest rule).

The bottom-up parser scans the input until there is a matching rule, which replaces the input with the rule. This until the input is finished. Partially matched rules are put into the parsing stack.

Stack Input
2 + 3–1
Term + 3–1
Term operation 3–1
Expression –1
Expression operation 1
Expression

This bottom-up parser is called the shift-to-go parser because the input is moved to the right (imagine that a pointer moves from point to input and gradually to the right) and gradually to the syntax tree.

Automatically create parsers

There are tools that can create parsers for you, which are often referred to as parser generators. You just have to provide the syntax--vocabulary and sentence rules--it can generate a working parser. Creating a parser requires a deep understanding of the parser, and it is not easy to create an optimized parser manually, so the parser generation tool is useful.

WebKit uses two well-known parser generation tools: Flex is used to create the lexical parser, and bison is used to create parsers (you might see them as Lex and yacc names exist). Flex's input file is the regular expression definition of the symbol, and the input file of the bison is the syntactic definition of the BNF format.

HTML parser

The work of the HTML parser is to parse the HTML markup into the parse tree.

HTML Syntax Definitions

The lexical and syntactic definitions of HTML are defined in the specifications created by the organization. The current version of HTML4,HTML5 is working in progress.

Not context-independent syntax

As seen in the introduction to parsers, syntax can be defined in a formal format that is similar to BNF. Unfortunately, all of the general parser discussions are not available for HTML (I mention that they are not for entertainment, they can be used to parse CSS and JavaScript). HTML cannot be defined with the context-independent syntax required by the parser. The past HTML format specification is defined by a DTD (Document Type definition), but it is not a context-independent syntax.

HTML is pretty close to XML. There are many parsers available for XML. HTML also has an XML variant called XHTML, so what are the main differences? The difference is that the HTML app is more "forgiving", allowing you to omit some start or end tags. It's all a "soft" syntax, not as rigid as XML. In general, this seemingly subtle difference has resulted in two different worlds. On the one hand this makes HTML popular because it embraces your mistakes and makes life easier for Web authors. On the other hand, it makes it difficult to write syntax formatting. Therefore, in general, HTML parsing is not simple, out-of-the-box context-sensitive parser can not, XML parser.

HTML DTD

The definition of HTML uses a DTD file. This format is used to define the SGML family language, which contains definitions of all allowed elements, including their attributes and hierarchical relationships. As we said earlier, HTML DTD is not context-independent syntax.

There are several different types of DTDs. Strict mode is fully compliant, but other modes for forward compatibility may include support for tags used by earlier browsers. The current strict mode dtd:http://www.w3.org/tr/html4/strict.dtd

Dom

The tree that the parser outputs is made up of DOM elements and attribute nodes. The full name of the DOM is: Document Object Model. It is an object-like description of an HTML document and an interface between HTML elements and the outside world (such as JavaScript).

The DOM has almost one by one corresponding relationships with tags, such as the following tags

will be converted to a DOM tree like this:


Figure 8:dom Tree of the example markup

As with HTML, the DOM specification is also developed by the consortium. Reference: Http://www.w3.org/DOM/DOMTR. This is a common specification for operating documents. There is a dedicated module that defines HTML-specific elements: http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/idl-definitions.html.

When we say that the tree contains DOM nodes, it means that the tree is made up of elements that implement the DOM interface. These implementations contain some other properties that are required internally by the browser.

Analytic algorithm

As we saw earlier, HTML cannot be parsed using a top-down or bottom-up parser.

The reasons are as follows:

    1. The characteristics of language tolerance
    2. The fact that the browser needs to provide fault tolerance for invalid HTML.
    3. The repetition of the parsing process. Usually the source code does not change during parsing. However, in HTML, the script tag contains "document.write" when the content can be added, that is, the parsing process will actually change the source code.

The browser creates its own parser to parse the HTML document.

In the HTML5 specification, the analytic algorithm has the specific description, the analysis consists of two parts: Word segmentation and construction tree.

Participle is the lexical analysis part, which parses the input into a sequence of symbols. In HTML, a symbol is a start tag, an end tag, a property name, and a generic value.

The word breaker recognizes these symbols and feeds them into the tree builder, and then proceeds to parse the next symbol until the input is finished.


Figure 6:html Parsing process (originating from the HTML5 specification)

Word Segmentation algorithm

The output of the algorithm is an HTML symbol. The algorithm can be described by a state machine. Each state consumes one or more characters from the input stream and updates the next state based on them. Decisions are affected by the current symbol state and the build state of the tree. This means that the same characters may produce different results depending on the current state. The algorithm is too complex and we use an example to see how it works.

For the basic example, analyze the following tags:

The initial state is "Data state", and when "<" is encountered, the status changes to "Tag open". The "Start tag token" was created after eating a "A-Z" character and changed to "tag name State". We remain in this state until we encounter ">". Each character is appended to the new symbol name. In our case, the solution is "HTML".

When the ">" is encountered, the current symbol is completed and the status is changed back to "Data State". The "<body>" tab will be processed in the same way. Now that the "HTML" and "Body" tabs are complete, we go back to the "Data State" . Eating "H" (the first letter of "Hello World") produces a character symbol until the "<" sign of "</body>" is met, and we complete a character symbol "Hello World".

Now we go back to the "Tag open State" . Eating the next input "/" produces an "end tag token" and changes to the "tag name state" status. Again, this state remains until we hit ">". When the new tag symbol is complete, we go back to "Data State". Likewise, "


Figure 9: Word processing for the sample input source

The algorithm of tree construction

When the parser is created, the Document object is also created. During tree construction, the root node (documen) of the DOM tree is modified and elements are added to it. The nodes completed by each word breaker are processed by the tree builder. The specification defines which DOM object each symbol is related to. In addition to adding elements to the DOM tree, it is added to an open element stack. This stack is used to correct nested errors and label not close errors. This algorithm is also described by the state machine, and its state is called "insertion modes".

Let's take a look at the tree build process entered below:

During the construction of the tree, the input is the sequence of symbols obtained during the word segmentation process. The first mode is called "initial mode". After receiving the HTML symbol, it becomes "before HTML" mode and the symbol in this mode is re-processed. This creates a htmlhtmlelement element and appends it to the root document node.

Then the state changes to "before head". When we receive "body", we implicitly create a htmlheadelement, even though we do not have this tag, it will be created and added to the tree.

Now we go into "in head" mode, then "after Head", the body is re-processed, the htmlbodyelement element is created and inserted, and then goes into "in body" mode.

When the character symbol "Hello World" is received, a "Text" node is created and all characters are appended to one by one.

After receiving the body end tag, enter "After body " mode, and after receiving the HTML end tag, enter "after Body" mode. Parsing is terminated when all symbols have been processed.


Figure 10: Action after parsing of the sample HTML tree

At this stage the browser marks the document as interactive and begins parsing the script for the deferred pattern. "Deferred" means that the script should be executed after the document parsing is complete. When the script is processed, it enters the "complete" state and the "load" event occurs.

The complete algorithm is included in the HTML5 specification: Http://www.w3.org/TR/html5/syntax.html#html-parser

Fault tolerance of the browser

You will never see HTML page syntax errors. The browser will fix the error and continue. Take a look at the following example:

I must have violated millions of rules ("My tag" is illegal tag, "P" and "div" element nesting error, etc.), but the browser still displays correctly, without any complaints. So many parser codes are correcting the errors of these HTML authors.

Browser error handling is quite unified, amazing is that this is not part of the current HTML specification, like bookmarks, forward, backward, just for many years in the browser developed. Some invalid HTML constructs appear on many websites, and browsers try to fix them in a consistent manner with various other browsers.

This requirement is defined in the HTML5 specification, and webkit a good summary in the comments at the beginning of its HTML parser class:

The parser parses the input symbol to generate the document and builds the document tree. If the document is well-formed, parsing is simple.
Unfortunately, we have to deal with a lot of poorly formatted HTML documents, and the parser needs to tolerate these errors.
We need to take care of at least the following errors:
1. The element must be inserted in the correct position. Labels that are not closed should be one by one closed until a new element can be added.
2. Adding elements directly is not allowed. Users may miss some of the tags, such as: HTML HEAD BODY TBODY TR TD LI (What do I miss?). )。
3. When you add a block element to the inline element, you should close all inline elements and add the block element.
4. If the above does not work, close all elements until you can add or ignore the label.

Let's look at some examples of WebKit fault tolerance:

Use </br> replace <br>

Some sites use </br> instead of <br>. For better compatibility with IE and Firefox, WebKit treats it as a <br>. The code is as follows:

if (T->isclosetag (Brtag) && M_document->incompatmode ()) {     reportError (malformedbrerror);     T->begintag = true;}

Note that the error handling here is internal and will not be displayed to the user.

The Lost form

As in the following example, a table is included in the contents of another table, but not in the cells of an external table:

<table><table><tr><td>inner table</td></tr>         </table><tr> <td>outer table</td></tr></table>

WebKit will change the hierarchical relationships and process them into two pro tables:

<table><tr><td>outer TABLE</TD></TR></TABLE><TABLE><TR><TD >inner table</td></tr> </table>

Code:

if (m_instraytablecontent && localname = = Tabletag)        popblock (Tabletag);

WebKit saves the current element with a stack, and it pops the table inside to the external table stack, making them a sibling table.

Element nesting

To prevent the nesting of one form, the second form is ignored. Code:

if (!m_currentformelement) {        m_currentformelement = new Htmlformelement (Formtag,    m_document);}
Over-depth element level

The note does not speak self-metaphor:

Www.liceo.edu.mx is a typical layer that is too deep, with a large <b> nested to 1500 tag depths. We only allow 20 consecutive occurrences of the same label, all of which will be ignored.
BOOL Htmlparser::allownestedredundanttag (const atomicstring& tagName) {Unsigned i = 0;for (htmlstackelem* curr = m_ Blockstack;         I < cmaxredundanttagdepth && Curr && curr->tagname = = TagName;     Curr = Curr->next, i++) {}return I! = cmaxredundanttagdepth;}
Wrong HTML or body end tag location

The annotations are still clear:

Support Real error HTML We never close the tag, because some stupid web page closes it before the document really ends. Let's use End () to close the label.
if (t->tagname = = Htmltag | | t->tagname = = bodytag)        return;

So the Web authors are careful, unless you want to write an example code that WebKit fault tolerance, write the HTML in the correct format.

Rendering engine, HTML parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.