WebKit: Dom Transcoding and parsing

Last Update:2014-07-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Because real data processing is done by Documentparser::appendbytes and Documentparser::finish follow-up calls, so let's focus on these two pieces

Data reception and decoding

Textresourcedecoder

Textresourcedecoder::d Ecode ()

An important action in this function is to dump the received string into Textresourcedecoder:: M_buffer.

Here first called the Textresourcedecoder::checkforheadcharset, the function is to check the HTML header information is encoded in the information, the general HTML page if the encoding information is specified, then the encoding information will be placed in the

Each new string data received is appended to this textresourcedecoder:: M_buffer for Textresourcedecoder processing.

A htmlmetacharsetparser is then created and assigned to Textresourcedecoder::m_charsetparser via Htmlmetacharsetparser:: The Checkformetacharset method performs a detection of the encoding and, if detected, sets the acquired encoding textencoding type to Textresourcedecoder.

In Textresourcedecoder, there are members textencoding m_encoding, and Encodingsource M_source, respectively, the specific types and sources of encodeing are recorded.

Textresourcedecoder has a member ownptr<textcodec>m_codec; it is responsible for the actual decoding operation, through Textcodec:: Decode to decode, Textresourcedecoder::m_buffer data is passed in and decoded to get a string type of data

Decodeddatadocumentparser

Decodeddatadocumentparser::appendbytes ()

Call writer->createdecoderifneeded ()->decode by passing in Parameters Documentwriter and char* to transcode the network data, and finally call append

Htmldocumentparser

Htmldocumentparser::append

In Decodeddatadocumentparser::appendbytes, the append operation is performed when the Textresourcedecoder::d Ecode gets the data. This function is implemented in Htmldocumentparser, so it is called htmldocumentparser::append. and passes the decoded string as a parameter. In Htmldocumentparser, there is a member Htmlinputstream M_input, where the string data passed into the parameter is appended to the htmldocumentparser::m_input. In this way, the decoded string has been saved in the Htmldocumentparser.

Documentwriter

Documentwriter::createdecoderifneed ()

If M_decoder is empty, an instance is created and returned directly if it is not empty.

Information such as mimetype,textencoding is passed in when it is created. The created Textresourcedecoder will be assigned to Documentwriter::m_decoder. After that, I did some work on setting encoding.

The newly created Textresourcedecoder is then set to document (find document by frame). Look at the document's members, there is also a refptr<textresourcedecoder> m_decoder; The newly created Textresourcedecoder is assigned to document::d Ecoder. Finally, the created Textresourcedecoder is returned,

Documentwriter::begin ()

Create a Document object, create a Documentparser object

Documentwriter::adddata ()

Documentwriter::end ()

The decoding process is complete, we are in htmldocumentparser, and we have saved the decoded input data.

Analytical

Premise: Tokens usually represent syntax punctuation such as keywords, variable names, strings, direct amounts, and curly braces. Token: Tokens, tokenize: token, Tokenizer: Token parser.

The effective word is token, the process is tokenizing, and the tool for dealing with this process is tokenizer.

Call Htmldocumentparser finish ();

Call Htmldocumentparser attempttoend ();

Call Htmldocumentparser preparetostopparsing ();

Call Htmldocumentparser pumptokenizerifpossible ();

Call Htmldocumentparser Pumptokenizer ();

Where the real page element is parsed, a pumpsession is created first.

then the while loop, by examining the situation of the pumpsession, keeps looking down the token to iterate over all the elements to parse, the token found through Htmltokenizer nextToken

Call Htmltreebuilder Constructtreefromtoken ();

Create a Atomichtmltoken with the htmltoken of the parameters passed in. This atomichtmltoken is very similar to the Htmltoken members, except that some of the information in Htmltoken, such as m_data,m_attributes, is the extent of the information in the input stream data (start and end positions), In Atomichtmltoken, the data types that are not related to the input stream data are stored, and the m_data of the data is converted to Atomichtmltoken according to the Htmltoken to convert some of it to specific meaning members, such as the type Starttag. :: M_name value, that is, htmltoken::m_data is the meaning of the label name under that type.

This is done after the Atomichtmltoken is converted, and if the type of the parameter htmltoken is not character then the clear operation is reset to its members

Call Htmltreebuilder::constructtreefromatomictoken ();

The function handles Atomichtmltoken through Htmltreebuilder::p rocesstoken.

Call Htmltreebuilder Processtoken ();

This function does different distribution processing for each type of atomichtmltoken. According to the type of Atomichtmltoken, do forwarding processing, that is, call the corresponding processxxx function to handle the corresponding type of atomichtmltoken, such as the current Starttag, then enter Htmltreebuilder:: Processstarttag.

Htmltreebuilder member Insertionmode M_insertionmode, this mode is to save the current insert mode, What the insert mode is. He actually realized a poor automatic state machine, he converted his state according to the input token, and completed the parsing of token in the state transition function. When a state transition is performed, the HTMLCONSTRUCTIONSITE::INSERTHTMLXXX function is called when the build of the DOM is started with token.

In the Htmltreebuilder there are members htmlconstructionsite M_tree; Htmltreebuilder actually completes the recognition of tokens, the maintenance of state machines. According to the incoming token to run the state machine, through the state machine conversion function, to find out what kind of node to do the creation, the specific node is created by Htmlconstructionsite to complete the

If the current token type is starttag, through the identification of the type, enter Htmltreebuilder::p the processing of the Rocessstarttag. Through the state machine processing, the state machine state becomes beforehtmlmode, in this state to Starttag type token processing, is the execution of Htmlconstructionsite:: Inserthtmlhtmlstarttagbeforehtml.

Creation of Node

Htmlconstructionsite::inserthtmlhtmlstarttagbeforehtml

The function creates a htmlhtmlelement through htmlhtmlelement::create and passes htmlconstructionsite::m_document as a parameter, recalling that there are members in node document*m _document; used to indicate which document the node belongs to. So when you create node here, you tell it by argument who it is.

Look at the inheritance system of Htmlhtmlelement:

Node

Containernode

Element

Styledelement

HtmlElement

Htmlhtmlelement

The member document* m_document is defined in node, and renderobject* m_renderer;

Member QualifiedName M_tagname is defined in element, and mutablerefptr<namednodemap> M_attributemap;

These are important members, document identifies which document the node is located under, and each node can only be under one document, and document is the root of the DOM tree.

The RenderObject identifies which renderobject is corresponding to the node, and each node has a renderobject corresponding to its one by one.

The QualifiedName M_tagname identifies the type of the element.

NamedNodeMap M_attributemap Identifies the attribute of the element.

In addition, node also has members responsible for constructing the tree structure.

Once you know the information about these members, continue to see htmlconstructionsite::inserthtmlhtmlstarttagbeforehtml.

After you have created Htmlhtmlelement, set the properties in the Atomichtmltoken to Htmlhtmlelement.

The htmlhtmlelement is then pressed into a htmlelementstack.

The function ends. After this process, a

The type of several htmltoken defined in Htmltoken.

Enum Type {

Uninitialized,//undefined, default

DOCTYPE,//document type

Starttag,//start tag

Endtag,//end tag

Comment,//Notes

Character,//element content

Endoffile,//document End

};

<metahttp-equiv= "Content-type" content= "Text/html;charset=utf-8"/>

<body>

<p>test content</p>

</body>

Analysis of

In the Htmltokenizer::nexttoken the Htmldocumentparser::m_token will be passed in, here is the reference way, that is, they use the same htmltoken.

At the initial time, Htmltokenizer::m_state is datastate. The Htmltoken type is uninitialized.

When reading ' < ', Htmltokenizer::m_state becomes tagopenstate. Htmltoken type uninitialized.

When reading ' h ', Htmltokenizer::m_state becomes tagnamestate. The Htmltoken type becomes starttag, and the property list and current properties are cleared, and the current character ' H ' is added to the htmltoken::m_data.

When reading ' t ', Htmltokenizer::m_state is tagnamestate. The Htmltoken type is starttag and the current character ' t ' is added to the htmltoken::m_data.

Read ' m ', ' l ', ibid.

When reading ' > ', Htmltokenizer::m_state becomes datastate. The Htmltoken type is starttag. The Htmltokenizer::nexttoken function performs a return operation.

After the above, Htmltokenizer::nexttoken completed a "word" analysis, that is, after reading the "

The Htmltoken is then parsed by Htmltreebuilder:: Constructtreefromtoken.

Htmltokenizer::nexttoken is run in htmldocumentparser::p umptokenizer , Htmldocumentparser has a member ownptr< htmltreebuilder>m_treebuilder; This member is created together with the Htmldocumentparser construct.

Recall here that the document class has member Documentparser, whereas the HTMLDocument class is a subclass of document, and Htmldocumentparser is a subclass of Documentparser. That is, there are htmldocumentparser members in the HTMLDocument class. These two correspond to each other. When Htmldocumentparser is created, HTMLDocument passes itself as a parameter. The Htmldocumentparser constructor will also create a htmltreebuilder,htmltreebuilder that will receive a htmldocument pointer and record it in its member Htmltreebuilder:: m_ Document Another member, Htmltreebuilder::m_parser, records the Htmldocumentparser pointer.

Parsing of the line break after

Back to Htmldocumentparser::p Umptokenizer.

Enter Htmltokenizer::nexttoken again in the while loop.

At this point, Htmltokenizer::m_state is datastate. The Htmltoken type is uninitialized.

Htmltokenizer::m_state is datastate when reading ' line feed '. The Htmltoken type becomes character. Adds the current character ' line feed ' to Htmltoken::m_data.

When reading ' < ', Htmltokenizer::m_state is datastate. The Htmltoken type is character. Htmltokenizer::nexttoken returns.

In Htmldocumentparser: Enter Htmltreebuilder::constructtreefromtoken in the:p Umptokenizer

In Htmltreebuilder::p Rocesscharacterbuffer, after judging M_insertionmode as Beforeheadmode, the string is detected in the process, and after the white space character has no other characters, return directly.

Call Htmltreebuilder::constructtreefromatomictoken

Call Htmltreebuilder::p Rocesstoken

Call Htmltreebuilder::p rocesscharacter

Call Htmltreebuilder::p rocesscharacterbuffer

Analysis of

Back to Htmldocumentparser::p Umptokenizer.

At this point, Htmltokenizer::m_state is datastate. The Htmltoken type is uninitialized.

Processing of the state machine similar to the treatment of

The name of the token generated here is Headtag. A htmlelement was created with the token.

A htmlconstructionsite::attachtocurrent is then called. Who is this current? It is htmlconstructionsite::m_openelements this stack of the top element, remember the process of creating htmlhtmlelement, put htmlhtmlelement into the stack, Now get to the top of the stack element is just the htmlhtmlelement. The attach process is to build two node into a parent-child relationship.

Then the new htmlelement is pressed into the stack.

Thus, each time a start tag is received, when a new htmlelement is created, the new htmlelement is put into the stack so that the parent node of the node to be inserted (the top element of the stack) can be found through the stack. That is, the new element is a child node belonging to this most recent start tag. As a conclusion, when the closing tag is received, there should be a stack operation, so that the top element of the stack identifies the parent node of the new node. That is, the stack maintains the switch of the label, the stack is similar to a function call, when entering a function, the function into the stack, when continuing to enter the child function, the child function into the stack. When the function returns, the function is out of the stack. The function stack is used to identify the nesting position of a function, and the nested position of a tag can be identified by the same tag stack.

Analysis of <metahttp-equiv= "Content-type" content= "Text/html;charset=utf-8"/>

Back to Htmldocumentparser::p Umptokenizer.

The htmltokenizer::m_state is tagnamestate when handling <meta, as before the

When a ' space ' is read, Htmltokenizer::m_state becomes beforeattributenamestate. The Htmltoken type is starttag.

When reading ' H ', Htmltokenizer::m_state is beforeattributenamestate. The Htmltoken type is starttag. Through the Htmltoken::addnewattribute let htmltoken::m_attributes open a new attribute space, Htmltoken has a member attribute* M_currentattribute; Used to point to the current property, that is, it now points to the address of the newly opened property space. by Htmltoken:: Beginattributename sets the starting range of the property name, adding the character ' H ' by Htmltoken::appendtoattributename. There are some functions in htmltoken that are specifically used to maintain property information, and when htmltokenizer resolves to the attribute information, his htmltoken is not changed, still Starttag, However, the state machine status becomes Beforeattributenamestate, and the parsing of the data in the state processing function causes the setting of the Htmltoken property.

When the ' t ' is read, the htmltokenizer::m_state becomes attributenamestate. The Htmltoken type is starttag. Adds the current character ' t ' to the current property name of the Htmltoken.

Reading "Tp-equiv" is the same as reading the ' t ' character flow above.

When read ' = ', Htmltokenizer::m_state becomes beforeattributevaluestate. The Htmltoken type is starttag. The end offset of the property name is set by Htmltoken::endattributename and the start and end of the property value are cheap. The property value has not started yet, and it is probably set up to avoid the absence of property values in the HTML page.

When you read ' double quotes ', Htmltokenizer::m_state becomes attributevaluedoublequotedstate. The Htmltoken type is starttag. Start the offset by setting the property Htmltoken::beginattributevalue.

When reading ' C ', Htmltokenizer::m_state is attributevaluedoublequotedstate. The Htmltoken type is starttag. Adds the current character ' C ' to the current property value of Htmltoken.

Reading "Ontent-type" is the same as reading the ' C ' character flow above.

When you read ' double quotes ', Htmltokenizer::m_state becomes afterattributevaluedoublequotedstate. The Htmltoken type is starttag. Sets the property end offset by Htmltoken::endattributevalue.

When a ' space ' is read, Htmltokenizer::m_state becomes beforeattributevaluestate. The Htmltoken type is starttag.

When reading ' C ', Htmltokenizer::m_state becomes attributevaluestate. The Htmltoken type is starttag. Performs the same process of creating attributes as when reading ' H ' above.

Read "ontent=" text/html; The process of Charset=utf-8 "" is the same as the corresponding process.

When a ' space ' is read, Htmltokenizer::m_state becomes beforeattributevaluestate. The Htmltoken type is starttag.

When read '/', Htmltokenizer::m_state becomes selfclosingstarttagstate. The Htmltoken type is starttag.

When reading ' > ', the Self-closing property is set by Htmltoken::setselfclosing. After execution returns, exit Htmltokenizer::nexttoken.

The name of the token generated here is metatag. A htmlelement was created with the token. Set his properties and put him attach on the htmlelement created by the previous

Handling of

Back to Htmldocumentparser::p Umptokenizer.

With the above-mentioned basis, it is easier to look at this label, very similar to

When Htmltreebuilder handles the Htmltoken, it performs a stack operation on the htmlconstructionstie::m_openelements, which is what was mentioned earlier and updates the state machine condition of the htmltreebuilder. No more new node is created here.

Subsequent processing of the other labels is similar, and the following shows several of the corresponding Htmltoken stack cases for handling labels:

Through the processing of tokens above, it is known that the lexical analysis according to the state machine in Htmltokenizer::nexttoken can isolate a htmltoken, these use Htmltreebuilder:: Constructtreefromtoken is parsed to identify what type of node it is creating in its current state.

Htmltreebuilder is used for this period of parsing, he will first according to the type of htmltoken distribution processing (that is, through processxxx), in a branch processing, but also in accordance with the status of Htmltreebuilder self-maintenance state machine to make judgments, Further distribution of processing. That is, Htmltreebuilder will consider the current state, and will also consider the type of htmltoken, his state is equivalent to the Htmltoken current environment.

Htmltreebuilder eventually executes the corresponding node creation via Htmlconstructionsite (that is, through insertxxx).

Also note that there are members in the Htmlconstructionsite mutable htmlelementstack m_openelements; used to maintain a htmlelement stack where the user maintains the start and end of the label, That is, maintaining a hierarchical relationship of labels that determines which node a node is inserted under to determine the parent-child relationship.

And all the nodes created here have a public document, the document* m_document in the member Htmltreebuilder of the member Htmldocumentparser in HTMLDocument; This document corresponds to the beginning of the HTMLDocument.

Thus, all nodes created later are descendants of HTMLDocument.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

WebKit: Dom Transcoding and parsing

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support