first, the DOM introduction:
The DOM is the abbreviation for the Document Object model, and it is the documentation-objects-models. The DOM defines a set of language-and platform-agnostic interfaces that allow programming languages to access the modified document. Within CEF, HTML documents are interpreted as a tree structure, and a DOM tree. The following figure is an HTML document and its corresponding DOM tree.
second, the main class to generate the DOM tree and its relationship diagram:
Third, the DOM tree creation process:
First Htmldocumentparser will receive the HTML-formatted string, Htmltokenizer it, and then give the segmented Htmltoken object to Htmltreebuilder to build the DOM tree. 1. Token segmentation
Htmltokenizer's interior is htmltoken segmented by a complex set of state machines, with up to 70 states, with a detailed reference to the definition of enum state in HTMLTokenizer.h.
Partition the main interface of HTML:
BOOL Htmltokenizer::nexttoken (segmentedstring& source, htmltoken& token); Source is the incoming HTML string token for the htmltoken of the second split
Take the following HTML string as an example to analyze the Nexttoken segmentation process:
The first parameter of the Nexttoken source and the Pumptokenizer within the Html,htmldocumentparser are called continuously through a while loop until the entire HTML string is split. After each nexttoken, the string pointer within the source moves to the address of the next character of the current token, and the next token is parsed after the nexttoken is executed again.
The following figure is a detailed process for the first token to be split, and the subsequent token segmentation is similar. M_state is the state machine and CC is the currently parsed character. After entering the Nexttoken function, the source corresponding string is:
The red part of the figure is the parsing tag name, the tag name is stored in the M_data member of the Htmltoken object token, the orange part is the attribute name resolution, the blue part is the attribute value parsing, the parsing result is stored in the Htmltoken::attribute object, There is a attribute list (m_attributes) in Htmltoken that holds all the Htmltoken::attribute objects.
2. Token handling
With the Htmltoken object, immediately after the Htmldocumentparser call Constructtreefromhtmltoken to create the DOM tree, after several layers of function calls, comes Htmltreebuilder: The Processtoken function, which is the core function for handling tokens. Processtoken calls its corresponding processxxx function according to the token type (Processdoctypetoken, Processstarttag, Processendtag, Processcomment, Processendoffile) to handle the corresponding Token,token types are as follows:
Enum type {
uninitialized, //uninitialized
DOCTYPE, //document resolution type
Starttag, //start tag
endtag,/ / End tag
Comment, //comment
Character, //character
endoffile, //end of File
};
The 5 processxxx internally also handle tokens through a set of state machines, which are defined as follows:
Enum Insertionmode {
initialmode,
beforehtmlmode,
beforeheadmode,
Inheadmode,
Inheadnoscriptmode,
Afterheadmode,
templatecontentsmode,
inbodymode,
textmode,
Intablemode ,
Intabletextmode,
incaptionmode,
incolumngroupmode,
intablebodymode,
Inrowmode,
Incellmode,
Inselectmode,
Inselectintablemode,
afterbodymode,
inframesetmode,
Afterframesetmode,
Afterafterbodymode,
Afterafterframesetmode,
};
Htmltreebuilder the internal storage state of the variable is m_insertionmode, its initial value is Initialmode, with the htlm given above as an example, the entire DOM tree creation process M_insertionmode change process is as follows:
Here will m_insertionmode abbreviation state, first token (Type=starttag, name=html) into the Processtoken, the execution is state equals initialmode processing process, State is set to Beforehtmlmode, followed by state equals Beforehtmlmode processing, token is parsed into Htmlhtmlelement object, state becomes Beforeheadmode, function ends. Then call Processtoken processing again to handle the second token (Type=starttag, Name=head), at which time the state equals the beforeheadmode of the out process, creating the Htmlheadelement object, The state becomes Inheadmode and the function ends. Then the token (Type=endtag, name=head) is executed, the state equals the inheadmode process, and the state becomes Afterheadmode. The token (Type=starttag, name=body) is then processed, the process of state equals Afterheadmode is executed, the Htmlbodyelement object is created, and the state becomes Inbodymode. The token (Type=starttag, name=a) is then processed, the state equals the Inbodymode process, the Htmldivelement object is created, and the state is unchanged. The token (Type=endtag, name=a) is then processed, and the state equals the inbodymode process and the state is unchanged. The token (Type=endtag, name=body) is then processed, and the state equals the inbodymode process, and the state becomes Afterbodymode. The token (Type=endtag, name=head) is then processed, and the state equals the afterbodymode process, and the state becomes Afterafterbodymode. The token (Type=endtag, name=html) is then processed, and the state equals the afterafterbodymode process, the state becomes Inbodymode, and the state equals the inbodymode process.
At this point, the entire state machine process is finished.
3. Dom Tree Construction
The entire DOM tree is composed of these element objects combined with the token conversion Htmlxxxelement object in the whole state change above, and the composition process is implemented using the data structure of the stack (last in first out), the HTML document is nested and has high symmetry, This is quite in agreement with the characteristics of the stack. The Htmlelementstack object is the stack, M_top is its member, it points to the top of the stack, and the type is Elementrecord. The Elementrecord contains two members, one is M_item (type: Htmlstackitem), the element is stored, the other M_next (type: Elementrecord), points to the next element in the stack, and strings the entire stack together. The above content is also described in the class diagram in section Ii.
As I said earlier, Processtoken will call different processxxx functions according to the type of token, and when Type=starttag, call the Processstarttag function, Generate the corresponding Htmlxxxelement object according to the token name, then call Attachlater to create a deferred task Task,task type is insert (Insert node), The Task.parent (parent node) is assigned the element stored by the top of the stack as the node,task.child of the currently inserted DOM tree, and the element object is then used to generate the Elementrecord object and press it into the stack. When Type=endtag, call the Processendtag function to remove the Elementrecord object that corresponds to the token name from the stack.
After executing Processtoken, call Executequeuedtasks to process the task generated in the previous step, and finally call Task.parent->appendchildcommon after several layers of calls to insert element into the DOM tree. Its function is defined as follows:
void Containernode::appendchildcommon (node& child)
{
Child.setparentorshadowhostnode (this);
if (m_lastchild) {
child.setprevioussibling (m_lastchild);
M_lastchild->setnextsibling (&child);
} else {
setfirstchild (&child);
}
Setlastchild (&child);
}
When M_lastchild==null (and when the parent node first inserts the node), M_firstchild is assigned to child, and M_lastchild is assigned to child. When M_lastchild!=null, assigns the child.m_previous to M_lastchild, assigns m_lastchild.m_next to child, and updates the M_lastchild value to child.
dom Tree build process diagram: after inserting the first node (Htmlhtmlelement object), the DOM tree is as follows, where HTMLDocument is the root node of the entire document: After inserting the second node (Htmlheadelement object), The DOM tree is as follows: After inserting the third node (htmlbodyelement), the DOM tree is as follows: After inserting a fourth node (Htmldivelement object) The DOM tree is as follows:
These are the entire DOM tree creation process.