XML data encoding method

Last Update:2018-12-05 Source: Internet

Author: User

Tags xml attribute xml parser microsoft iis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article introduces how character encoding works, especially in XML and MSXML Dom.

How to make XML files pass data correctly across different platforms. The XML document is created, the data is typed, several tags are attached, the tag format is adjusted, and even the declaration is placed as an additional addition. Then they try to load it, but what they get is unexpected error messages, Microsoft? XML Parser (MSXML) reports data problems. This is frustrating for the XML creator. Is it actually not working properly?

Of course not. When an unexpected error message is received from MSXML, it is probably because the platform that receives the data stores it on a different platform than the platform that sends the data, resulting in character encoding problems.

Cross-platform Data Format
Since the computer software and hardware practitioners managed to connect the two computers, they have been striving to create cross-platform technologies and enable different platforms to share data. For a long time, as the number of computers of different types, connection methods, and types of data to be shared have rapidly increased, the situation has become increasingly complex.

After decades of research on cross-platform programming technology, the only real cross-platform solution today (and possibly for a long time in the future) is achieved through a simple standard data format. The success of web is based on this format. The main content transmitted between the Web server and the web browser is the HTTP title and HTML page, both of which are standard text formats.

In the following sections, I will discuss character encoding and standard character sets, Unicode, HTML Content-Type headers, HTML Content-Type meta tags, and character entities. If you are familiar with the above concepts, you can skip this content to read the encoding XML data for the XML Document Object Model (DOM) Programmer (for the XML Document Object Model (DOM) tips and tips for programmers coding XML data. For more information, see XML and character encoding (XML and character encoding ).

About character encoding
The standard text format is based on the standard character set. Remember that all computers store text as numbers. However, different systems can use different numbers to store the same text. The following table shows how a group of bytes are stored. The first one is to use the default code page 1252 to run microsoft windows? The second is a typical apple that uses the Macintosh Roman code page? Macintosh? Computer.

Byte windows Macintosh
140 & #338; & aring;
229 & aring; & acirc;
231 & ccedil; & aacute;
232 & euml;
233 é & egrave;

For example, if your grandmother ordered a new book from a http://www.barnesandnoble.com/(English), she wouldn't think of the way her Macintosh computer stores characters, not the same as running www.barnesandnoble.com) windows 2000 web server. When entering the address in the Swedish family in the shipping column of the Internet purchase order, she believes that the Internet will pass the character & aring; (the byte value on its Macintosh is 140 ), I did not expect that the computer that receives and processes the message she sent will convert the byte value 140 to the letter & #338 ;.

Unicode
Unicode Consortium is convinced that it is a good idea to define a common code page (in double byte instead of single byte), which is applicable to all languages around the world, therefore, the ing problem between different code pages will no longer exist.

In this case, if Unicode solves the problem of cross-platform character encoding, why is it not the only standard? The first problem is that converting to Unicode sometimes means doubling the size of all files, which is unimaginable in the online world. So someone is still happy to use old, single byte character sets such as ISO-8859-1 to ISO-8859-15, shift-JIS, EUC-KR and so on.

The second problem is that there are still many systems that are not Unicode-based, which means that on the Network, some bytes that make up Unicode characters may cause serious problems for older systems. Therefore, the "Unicode conversion format (UTF)" is defined. They use the bit conversion technology to encode Unicode characters so that they are "Transparent" (or can be safely passed) on the old system).

The most popular character encoding is UTF-8. The UTF-8 uses the first 127 characters of the Unicode Standard (which are essentially Latin characters a-Z, a-Z, 0-9, and several punctuation characters ), and directly map it to a single-byte value. Then, the bitwise conversion technology is used to encode the rest of the Unicode characters with the byte height. The result is that the small Swedish Character & aring; (0xe5) is changed to the following dual-byte garbled characters: & atilde; & yen; (0xc3 0xa5 ). So unless you are able to perform bitwise conversions in your mind, the data encoded in the UTF-8 cannot be understood.

Content-Type title
Because the old single-byte character set is still used, the problem of data transmission can be solved only when the actual character set of the data is specified. Recognizing this, the Internet email and HTTP protocol team defined a standard method to specify character sets in the message title Content-Type attribute. This attribute specifies a character set from the registered Character Set Name List, which is defined by Internet Assigned Numbers Authority (IANA. Typical HTTP headers may contain the following text:

HTTP/1.1 200 OK
Content-Length: 15327
Content-Type: text/html; charset: ISO-8859-1;
Server: Microsoft-Microsoft IIS/5.0
Content-location: http://www.microsoft.com/Default.shtm
Date: Wed, 08 Dec 1999 00:55:26 GMT
Last-modified: Mon, 06 Dec 1999 22:56:30 GMT

The title indicates to the application that the content following the title is located in the ISO-8859-1 character set.

Content-Type metatag
The Content-Type attribute is optional. In some applications, the HTTP header information is removed, but only HTML itself passes. To remedy this, the HTML standard workgroup defines an optional meta-TAG method that specifies the character set of the HTML document and enables the HTML document character set to be self-described.

In this case, the character set ISO-8859-1 describes that in this particular HTML page, the byte value 229 represents & aring ;. This page is completely clear to any system and data will not be misinterpreted. Unfortunately, this meta mark is optional, so it leaves a blank space for the error.

Character entity
Not all systems support all registered character sets. For example, I don't think many platforms actually support the IBM host Character Set called ebcdic. Windows NT is supported, but many other systems are likely not, which is probably why the http://www.ibm.com homepage generates ASCII.

As an alternative, HTML allows encoding of a single character on the page by specifying the exact UNICODE character value. Then, these character entities are analyzed out of the character set to precisely determine their Unicode values. Its syntax is amp; #229; or amp; # xe5 ;.

XML and character encoding
XML draws on these ideas from HTML and further defines a completely clear algorithm to determine the character set used for encoding. In XML, the optional encoding attribute in the XML Declaration defines character encoding. The following algorithms determine the default encoding:

If the file starts with a unicode byte order sign [0xff 0xfe] or [0xfe 0xff], it is considered to be in UTF-16 encoding. Otherwise, it is in the UTF-8.

The following are all correct and equivalent XML documents:

Character Set or encoding HTTP title XML document
ISO-8859-1 Content-Type: text/XML; charset: ISO-8859-1; & Aring; lt;/test>
UTF-8 Content-Type: text/XML; & Atilde; & yen;
ISO-8859-1 Content-Type: text/XML;

& Aring; lt;/test>
UTF-8 (with character entity) Content-Type: text/XML; & #229;
UTF-16 (UNICODE with bytecode) Content-Type: text/XML; FF Fe 3C 00 74 00 65 00 73 00 74 00 3E 00 E5 00 .....
3C 00 2f 00 74 00 65 00 73 00 74 00 3E 00 0d 00...
0a 00

Character Set and MSXML dom
Now, after discussing different character encoding methods, let's take a look at how to load the XML document in the MSXML Dom, and when you encounter fuzzy encoding characters, type of error messages that may be received. The loadxml and load methods are the two main methods for loading xml dom documents.

The loadxml method always uses Unicode BSTR encoded only in the UCS-2 or UTF-16. If you pass invalid Unicode BSTR content to loadxml, the loading will fail.

The load method can treat the following content as variant:

Value description
If variant is BSTR, it is interpreted as URL.
Vt_array | vt_ui1 variant can also be a safearray containing the original encoded bytes.
Iunknown if variant is an iunknown interface, the DOM document istream, ipersiststream, and ipersiststreaminit calls QueryInterface.

The load method implements the following algorithms to determine the character encoding or character set of XML.

If the Content-Type HTTP header defines a character set, this character set replaces all content of the XML document. Because there is no HTTP title, this is obviously not applicable to the safearray and istream mechanisms.

If there is a dubyte Unicode byte order sign, it assumes that the encoding is UTF-16. It can process both large endian and small endian.

If there is a four-byte Unicode byte order sign (0xff 0xfe 0xff 0xfe), it assumes that the encoding is a UTF-32. It can process both large endian and small endian.

Otherwise, it assumes that the encoding is a UTF-8, unless it finds the XML declaration by specifying the encoding attributes of some other character sets, such as the ISO-8859-1, Windows-1252, shift-JIS, and so on.
You will see two error messages that indicate encoding problems returned from the xml dom. The first usually points out that the characters in the document do not match the encoding of the XML document:

An invalid character is found in the text content.

The parseerror object can tell you the exact position of the character in a row, so that you can solve the problem.

The second error message indicates that the Unicode byte order mark (or the loadxml method is called) is used at the beginning ), the encoding property then specifies encoding that is not dubyte encoded (such as UTF-8 or Windows-1250 ):

Conversion from the current encoding to the specified encoding is not supported.

In addition, you may have called the load method and used single-byte encoding at the beginning (no bytecode flag ), however, it subsequently discovered the encoding properties for the specified dual-byte or four-byte encoding, such as UTF-16 or UCS-4.

The basic principle is that you cannot use the encoding attributes declared by XML to convert between a multi-byte character set such as a UTF-8, shift-JIS, or Windows-1250, and a Unicode character encoding such as a UTF-16, UCS-2, or UCS-4, this is because the Declaration must use the same number of bytes for each character as the rest of the document.

Finally, the ixmlhttprequest interface provides the following methods to access the downloaded data:

Methods description
Responsexml indicates the response entity analyzed by the MSXML Dom analyzer (using the same rules as the load method ).
Responsetext indicates the response entity of the string. This method blindly decodes the message entity received from the UTF-8. This is a known issue and should be resolved in the MSXML Web release to be available soon.
Responsebody indicates the response entity as an unsigned byte array.
Responsestream indicates the response entity of the istream interface.

Use MSXML to create a new XML document
Once an XML document is loaded, you can use Dom to process the XML document without considering any encoding issues, because the document is stored as Unicode in the memory. All xml dom interfaces are based on com bstr, which is a double-byte Unicode string. This means that you can create a MSXML Dom document from the memory that contains all Unicode characters, and all the components will share the DOM in the memory, without any doubt about the meaning of the Unicode character value. However, when you save it, MSXML will encode all data by UTF-8 by default. For example, assume that you have performed the following operations:

VaR xmldoc = new activexobject ("Microsoft. xmldom ")
VaR E = xmldoc. createelement ("test ");
E. Text = "& aring ;;
Xmldoc. appendchild (E );
Xmldoc. Save ("foo. xml ");

The results of the following UTF-8 encoding files are:

& Atilde; & yen;

Note that the preceding example is valid only when running in an environment other than a browser. Due to security restrictions, calling the Save method in the browser will not produce the same results.

Although this seems a bit strange, it is correct. The following tests load files encoded with a UTF-8 and test whether the UTF-8 is redecoded to a Unicode character value of 229. It is:

VaR xmldoc = new activexobject ("Microsoft. xmldom ")
Xmldoc. Load ("foo. xml ");
If (xmldoc.doc umentelement. Text. charcodeat (0) = 229)
{
Wscript. Echo ("yippee-It worked !! ");
}

To change the encoding used by the xml dom save method, you must create an XML declaration using the encoding attribute at the top of the document as follows:

VaR Pi = xmldoc. createprocessinginstruction ("XML ",
"Version = '1. 0' encoding = 'iso-8859-1 '");
Xmldoc. appendchild (PI );

When you call the Save method, you get the following ISO-8859-1-encoded file:

& Aring; lt;/test>

Be careful not to be confused by XML attributes. The XML Attribute returns the Unicode string. If you call the XML Attribute on the domdocument object after creating the ISO-8859-1 encoding declaration, you can retrieve the following Unicode string:

& Aring; lt;/test>

Please note that there is no ISO-8859-1 encoding declaration here. This is normal. The reason for this is that you can use this string to call loadxml, which will take effect. If this is not done, loadxml will fail and return an error message: "The current encoding cannot be switched to the specified encoding ."

Conclusion
I hope this article will help explain how character encoding works, especially in XML and MSXML Dom. Once you understand character set encoding, it is quite simple and XML is excellent because it leaves no room for ambiguity in this regard. Although MSXML Dom has a few strange characteristics that need to be closely watched, it is still a powerful tool that allows you to read and write any XML code.

For details
Microsoft msdn Online Library: xml dom reference (Microsoft msdn Online Library: xml dom reference)

Character encoding model (character encoding model) by Ken Whistler and Mark Davis

Iana character sets (IANA Character Set)

Internet Engineering Task Force (IETF) for http://www.ietf.org provides an RFC list

Microsoft msdn Online Library: compatibility issues with mixed environments (Microsoft msdn Online Library: compatibility issues with hybrid environments)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More