A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
The tradition of XML comes from the world of documentation, which is also reflected in its grammatical rules. Its syntax is more relaxed than the data format of a database record. The XML parser converts the encoded form of an XML document (encoded in an XML declaration) into an abstract model that represents the information of an XML document. The consortium formalized this abstract model as an XML infoset (see Resources), but many XML processing must consider encoded source code forms that allow for many different lexical formats: the order of the attributes is not specified, and the rules for the use of whitespace characters between the element names and their attributes are flexible, And there are many different ways to represent characters and escape special characters. Namespaces also bring greater lexical flexibility (such as whether using prefixes is optional). The result is that you have many documents that are exactly equivalent to the XML 1.0 rules, but they may be completely different when you compare the encoded source files in byte to bytes.
Lexical flexibility poses problems in areas such as regression testing and digital signatures. Suppose you want to create a test set that contains a case, and you want to use the document in Listing 1 as the correct output. Listing 1. Sample XML Document
<doc> <a a1= "1" a2= "2" >& #x31;& #x32;& #x33;</a> </doc>
If you want to do a rigorous XML test, you should realize that the document in Listing 2 should also be considered a correct output. Listing 2. The equivalent form of an XML document in Listing 1
<?xml version= "1.0" encoding= "UTF-8"?> <doc> <a a2= "2" a1= "1" >123</a > </doc>
The whitespace within the label is different, the order of the attributes is changed, and the character entity is replaced with the equivalent literal character, but there is no doubt that the Infoset is exactly the same. It is difficult to establish this equivalence by byte-by-comparison. Digital signatures are used if you want to ensure that documents sent through the message system are not corrupted or tampered with during delivery. To do this, you need to use an encrypted hash value or a perfect document digital signature. However, if you send listing 1 through the message system, after a general process, listing 1 is likely to become listing 2. In this case, a simple hash value or data signature cannot be matched, although the document is not actually modified.
The Consortium solution for this problem has been developed as part of the XML Digital signature specification. The consortium defines the canonical XML(see Resources), which is the canonical lexical form of XML, removes all allowable changes, and uses strict rules for byte-by comparison. The process of translating into canonical form is called Normalization (usually abbreviated as "c14n"). This article describes the XML canonical form.
Normalization form Rules
The best generalization of the c14n process is the list provided in the specification (I made some changes), as follows: The document is encoded with UTF-8. The break in the input before parsing is normalized to "#xA". Normalize property values like the validation parser. Adds a default property to the element, as in the case of a validation parser. Replace the CDATA sections with their literal character contents. Characters and resolved entity references are replaced with literal characters (except special characters). property values and special characters in character content are replaced with character references (same as normal well-formed XML). Deletes an XML declaration and a DTD. (Note: Usually I recommend using XML declarations, but I agree with the justification for ignoring XML declarations in canonical XML form.) Converts an empty element to a start-end label pair. To normalize the whitespace between the document elements and the start and end tags. Preserves all whitespace in character content (excluding characters that are removed during line-break normalization). The property value delimiter is set to quotation marks (double quotes). Deletes the extra namespace declarations in each element. A dictionary order is used for namespace declarations and attributes in each element.
There is no need to worry about whether these rules are a bit unclear, and they will be explained in detail later, with examples of more general rules that are actually applied. The part of the c14n step that involves DTD validation is not discussed in this article. I have mentioned XML infoset many times, but interestingly, the consortium did not choose to define c14n on the basis of infoset, but rather to define c14n according to the XPath data model (a simpler data model than Infoset, which some say is clearer). This may be an irrelevant detail and will not affect your understanding of canonical forms, but it is important to keep this in mind if you need to use infoset based technology.
Label normalization requires the application of a specific blank rule, a specific sequence of namespace declarations, and a structured attribute in the label. Here's my own summary of the informal steps to establish a normalized start tag: Left angle bracket (<) followed by element QName (prefix plus colon and local name). If there is a default namespace declaration, it is immediately followed, followed by other namespace declarations, in alphabetical order of the prefix defined. Omits all redundant namespace declarations (namespaces that have been declared in the ancestor element and that have not been overwritten). There is a space between each namespace declaration, and there are no spaces on either side of the equal sign and the URI containing the namespace. All attributes are in alphabetical order, preceded by a space, with no spaces on either side of the equal sign and double quotes containing the property value. Finally, the right angle bracket (>).
End tags in canonical form are simple: Left angle brackets (<), element QName, and right angle brackets (>). Listing 3 is an example of an XML that is not a canonical form. Listing 3. Examples of non-canonical forms of XML
<?xml version= "1.0" encoding= "UTF-8"?> <doc xmlns:x= "http://example.com/x" xmlns= "http://example.com/" Default "> <a a2=" 2 " a1=" 1 " >123</a> <b y:a1= ' 1 ' xmlns= ' http://example.com/ Default "A3= '" 3 "' xmlns:y= ' http://example.com/y ' y:a2= ' 2 '/> </doc>
Listing 4 is the canonical form of the same document. Listing 4. The canonical form of listing 3
<doc xmlns= "Http://example.com/default" xmlns:x= "http://example.com/x" > <a a1= "1" a2= "2" >123</ a> <b xmlns:y= "http://example.com/y" a3= ""3"" y:a1= "1" y:a2= "2" ></b> </doc>
Normalization Listing 3 requires the following modifications: delete the XML declaration (the document has been UTF-8 encoded and does not need to be converted). Place the default namespace declaration on doc in front of the other namespace (the namespace prefixed by x in this example). Reduce the spaces in the Start tab of a, leaving only one space before each property. Remove the extra default namespace declaration from the B start tag. Make sure that the remaining namespace declarations (for the Y prefix) appear in front of all other properties. Other properties are sorted alphabetically according to their qnames (such as "A3", "Y:a1", "Y:A2"). Change the delimiters on the Xmlns:y namespace declaration, property y:a1, Y:a2, and A3 from single quotation marks (') to double quotes ("), and A3 also need to escape the nested double quotes (") into ".
I used a transform from the Python c14n Module test Specification form, which comes from PyXML (see Resources). Listing 5 is the code used to normalize listing 3 to listing 4. Listing 5. The Python code that standardizes XML
From Xml.dom import minidom from xml.dom.ext import c14n doc = minidom.parse (' listing3.xml ') canonical_xml = c14n. Canonicalize (DOC) print Canonical_xml
Normalize character data
The canonical form of character data is basically to use text: Character entities are parsed into the original Unicode (and then serialized to UTF-8), CDATA sections are replaced with the original content, and other substitutions are made along this path. This method also applies to property values and character data in the content. Attributes are also normalized according to the DTD type rule, but this is primarily related to documents that use DTDs, which are no longer discussed in this article. The sample documentation in listing 6 is part of a reference to an example in the C14N specification. Listing 6. XML example of normalization of character data
<?xml version= "1.0" encoding= "iso-8859-1"?> <doc> <text>first line& #x0d;& #10; Second line</text> <value>& #x32;</value> <compute><![ Cdata[value> "0" && value< "10"? Valid ":" Error "]]></compute> <compute expr= ' value>" 0 "&& value<" 10 "?" Valid ":" Error "' >valid</compute> </doc>
Listing 7 is the same document in canonical form.
Normalization Listing 6 requires the following modifications: delete the XML declaration and convert it to UTF-8 form. Change the character reference 2 to the actual number 2. CDATA sections are replaced with content and use the > Escape the right angle bracket (>), with & Escape the ampersand (&) with < Escape the right angle bracket (<). Replace the single quotation mark used in the Expr property with double quotes, and then escape the double quotation mark (") to ".
An important step not covered in Listing 6 and listing 7 is the conversion to UTF-8, which is inconvenient to use as a checklist. Suppose the source document content contains a character reference & #169; (represents a copyright symbol), the canonical form will use the UTF-8 sequence instead (hexadecimal byte C2 and A9).
Sometimes the real need might be to sign or compare a subtree of an XML document, not all of it. You may only want to sign the SOAP message body and ignore the envelope. The consortium provides this mechanism in the specification of a proprietary canonical form , which basically sorts namespace declarations inside and outside the target subtree.
I mentioned the various possible changes caused by the prefix selection. XML namespaces a prefix is irrelevant, so only two files with different namespace prefixes should be treated as the same file. Unfortunately, C14N did not consider the situation. Some fully legitimate XML processing operations may modify prefixes, so be aware of such potential problems.
Canonical XML is an important tool that needs to be firmly mastered. You may not immediately encounter XML-related security or software testing issues, but once you are familiar with these areas you will wonder why you often need c14n. It is one of the tools that helps you clear out potential problems that you didn't notice at first.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service