How to validate DTD
The structure definition of the XML document by DTD is mainly reflected in two aspects: the definition of the child node type and the definition of attributes. To verify the DTD of an XML parser, you must verify the two DTD definitions. First, check the type of the DTD subnode.
DTD statements define child nodes in the following types:
<! Element A any>
Node A can contain any node type, which is the simplest case.
<! Element A (# pcdata)>
Node A can only contain text information.
<! Element A (B, c)>
Node a can and must contain Node B and node C, and Node B must be located before node C.
<! Element A (B *, c)>
A node can be followed by any number of B nodes, followed by a C node.
<! Element A (B ?, C +) *>
This situation is complicated. Ignore the *, B? Indicates that there can be one and at most one B node, and then at least one C node.
Then, consider the outermost layer *. The number of repeated occurrences of this combination is 0, 1, 2,... try to enumerate it. The possible situation is:
BC, BCC,..., C, CC, CCC,..., BBC, bbcc, bbccc..., bcbc, bcbcbcbc ,...,...
If you want to verify the DTD subnode type, this is obviously *,?, + The situation is complicated. In fact, we can find that this declaration method is itself a regular expression. Can we use regular expression validation to verify the type of the DTD subnode? The answer is yes.
Let's look at an example. <! Element A (B ?, C +) *> in this case, the memory data structure generated by the XML parser is:
Two types of layers:
Dtdelementdecl
Dtdelementdeclnode
Dtdelementdeclnode represents a declared subnode, while dtdelementdecl represents a complete node declaration.
The above expression produces the following structure:
Dtdelementdeclnode B;
Dtdelementdeclnode C;
B. setname ("B ");
B. setcounttype (enumoneorzero );
C. setname ("C ");
C. setcounttype (enumoneormore );
Dtdelementdeclnode D;
D. setcounttype (enumzeroormore );
D. addchild (B );
D. addchild (C );
Dtdelementdecl;
A. addchild (d );
Now we can see that there is a subnode under dtdelementdecl, which can be any one. This subnode contains its own two types of subnodes and loops down.
If you use a regular expression for verification, you first need to translate this hierarchy into a regular expression. For example, the above structure can be expressed as :( B? C +) *, very simple. You can use a regular expression analysis engine to analyze its structure (such as boost RegEx ).
With a regular expression, you can validate the XML document. However, since the Regular Expression Engine currently only supports string matching, therefore, you also need to convert the hierarchy of the corresponding nodes in the XML document into a corresponding string. For example:
<A>
<B/>
<C/>
<C/>
<C/>
<B/>
<C/>
</A>
From the previous analysis, we can see that this is a match <! Element A (B ?, C +) *> to the corresponding regular expression string, which can be expressed:
Bcccbc.
The final job is to use (B? C +.
The above is an implementation method for the structure verification of DTD elements in my XML parser. For the attribute validation, we will provide it in the next blog.
If you are interested in seeing the source code, please provide the mail address.