Detailed Introduction to Coding XML documents with UTF-8

Source: Internet
Author: User
Google's Sitemap service requires that all site maps published must be encoded in Unicode UTF-8. Google does not even allow other Unicode encodings (such as UTF-16), not to mention non-Unicode encodings such as ISO-8859-1. Technically, this means that Google uses a non-standard XML parser, because XMLRecommendation specifically requires that "all XML handlers must accept UTF-8 and UTF-16 encoding for Unicode3.1 ", but is this really a big problem? Google's Sitemap service requires that all site maps published must be encoded in Unicode UTF-8. Google does not even allow other Unicode encodings (such as UTF-16), not to mention non-Unicode encodings such as ISO-8859-1. Technically, this means that Google uses a non-standard XML parser, because XML Recommendation specifically requires that "all XML handlers must accept Unicode 3.1 UTF-8 and UTF-16 encoding ", but is this really a big problem?

Everyone can use UTF-8

Universality is the first and most convincing reason to choose UTF-8. It can process every word used in the world. Although there are still a few blank spaces, it is becoming increasingly inconspicuous and gradually filled out. The text that is not included is generally not implemented in any other character set, and cannot be used in XML even if it is included. In the best case, the text is transferred to a single-byte character set such as Latin-1 through the font borrow. Real support for such rare texts may come from Unicode first, and only Unicode supports them.

However, this is only one reason to use Unicode. Why choose UTF-8 instead of UTF-16 or other Unicode encoding? One of the most direct reasons is the wide range of tool support. Basically all major editors that may be used for XML can process UTF-8, including JEdit, BBEdit, Eclipse, emacs, and even Notepad. In XML and non-XML tools, there is no such extensive tool support for Unicode encoding.

For some of these editors, such as BBEdit and Eclipse, the UTF-8 is not the default character set. Now it is necessary to change the default settings, all tools should choose UTF-8 as the default encoding when leaving the factory. Unless this is done, when files are transferred across national boundaries, platforms, and languages, we will fall into a quagmire of non-interoperability. But before all programs use the UTF-8 as the default encoding, it is easy to modify the default settings. For example, in Eclipse, the General/Editors preferences panel shown in Figure 1 allows you to specify that all files use UTF-8. You may notice that Eclipse expects the default value to be MacRoman, but in this case, it cannot be compiled when the file is passed to a programmer using Microsoft Windows or a computer outside the United States and Western Europe.

Figure 1. change the default character set of Eclipse

Of course, to make the UTF-8 work, all the files exchanged by the developer must use the UTF-8, but that's not a problem. Unlike MacRoman, UTF-8 is not limited to a few texts or individual platforms. Anyone can use UTF-8. MacRoman, Latin-1, SJIS, and other legacy national character sets cannot.

UTF-8 works fine in tools that do not support multibyte data. Other Unicode formats such as UTF-16 often contain many zero bytes. Many tools interpret these bytes as the end of a file or other special delimiters, causing unwanted, unexpected, and often unpleasant results. For example, a string may be truncated from the second byte of the first ASCII character if the UTF-16 data is loaded to the C string as is. The UTF-8 file only contains null where it does indicate null. Of course, you should not use such naive tools to process XML documents. However, documents in legacy systems often end in a strange place, and no one really realizes or understands that those character sequences are just old bottled wine. UTF-16 is less prone to problems for systems that do not support Unicode and XML than UTF-8 or other Unicode encoding.

Experts' arguments

XML is the first important standard to fully support UTF-8, but it is just the beginning. Standard organizations are gradually recommending UTF-8. For example, a URL containing non-ASCII characters is a problem that has long plagued the Web. Urls that work on PCs that contain non-ASCII characters cannot be used for Mac, and vice versa. The World Wide Web Consortium (W3C) and the Internet Engineering Task Group (IETF) recently agreed that all URLs must adopt UTF-8 encoding instead of other encoding, which addresses this issue.

W3C and IETF have become increasingly tough with the first, last, and occasional use of UTF-8. The W3C Character Model for the World Wide Web 1.0: Fundamentals states that "if you must select a Character encoding, it must be a UTF-8, UTF-16, or UTF-32. The US-ASCII is compatible to the UTF-8 (the US-ASCII string is also a UTF-8 string, see [RFC 3629]), so the US-ASCII is ideal if you need to be compatible with the UTF-8 ." In fact, compatibility with US-ASCII is so important that it is almost necessary. W3C wisely explains, "in other cases, such as for APIs, UTF-16 or UTF-32 may be more appropriate. The reason for selecting a encoding may include the internal processing efficiency and interoperability with other processes ."

I agree with the reason for internal processing efficiency. For example, the internal representation of strings in Java uses UTF-16, so the index of strings is faster. However, Java code will never disclose this internal representation to programs that exchange data with it. On the contrary, for external data exchange, java. io. Writer must be used to explicitly specify the character set. UTF-8 is strongly recommended when selecting.

IETF is even more explicit. The IETF Charset Policy [RFC 2277] states that in a language without uncertainty:

The protocol must be able to use the UTF-8 character set, which consists of the ISO 10646 encoding set and the UTF-8 character encoding method. for details, see [10646] Annex R (revised in version 2 ).

In addition, the protocol may dictate how to use other ISO 10646 character sets and character encoding schemes, such as UTF-16, but the failure to use UTF-8 is a violation of this policy, such violations need to go through the change procedure ([BCP9] section 9th) as they enter or escalate to the standard tracking process, and provide clear and reliable reasons in the protocol specification documentation.

An existing protocol, or a protocol for transferring data from an existing data store, may need to support other datasets, or even use default encoding outside of the UTF-8. This is allowed, but must be able to support UTF-8.

Key: in the future, support for legacy protocols and files may require character sets and encoding outside of the UTF-8, but I will be very careful if so. UTF-8 should be used for each new protocol, application, and document.

Chinese, Japanese, and Korean

A common misunderstanding is that UTF-8 is a compression format. This is not the case. ASCII characters occupy only half of the space in the UTF-16, compared to other Unicode encodings, especially the UTF-8. But some characters of UTF-8 encoding takes up more than 50% of the space, especially Chinese, Japanese and Korean (CJK) such hieroglyphics.

But the actual size may be smaller than the UTF-8, that is, using the UTF-16 to encode cjk xml. For example, an XML document in Chinese contains a large number of ASCII characters, such as <,>, &, =, ", ', and spaces. The UTF-8 code for these characters is smaller than the UTF-16. The specific compression/expansion factors vary with documents, but in either case, the difference cannot be obvious.

Finally, it is worth mentioning that Chinese and Japanese hieroglyphics often use less words than Latin and Cyrillic. Because the characters are absolutely large, three or more bytes are required for each character to fully express these words, that is, the same words or sentences as English or Russian, these languages can be expressed in less words. For example, "tree" is represented by "wood" in Japanese (like a tree ). The UTF-8 represents three bytes, and the English word "tree" contains four letters and needs four bytes. In Japanese, "grove" is "forest" (two trees together ). The UTF-8 encoding requires three bytes, and the English word "grove" has five letters and needs five bytes. In Japanese, "Sen" (three bytes) still needs three bytes. The corresponding English text "forest" requires six bytes.

If compression is required, use zip or gzip. After compression, the size of the UTF-8 and UTF-16 is almost the same, regardless of the original size difference. Regardless of the encoding, the larger the original size, the more redundancy the compression algorithm removes.

Robustness

UTF-8 is a more robust and easier to interpret format than any other text encoding previously designed and later. First, UTF-16 has no endianness problem compared to UTF-8. The UTF-8 is represented the same with Big-endian and little-endian, because the UTF-8 is defined by 8-bit bytes rather than 16-bit characters. UTF-8 has no uncertainty of the byte order, the latter must be resolved through the byte order mark or other testing means.

A more important feature of UTF-8 is stateless. Each byte in the UTF-8 stream or sequence is clear. You can always know the location in the UTF-8, that is, given a byte, it can be determined immediately that it is a single-byte character, the first byte of double-byte characters, and the second byte of double-byte characters, or the second, third, or fourth byte of a three-byte/four-byte character (of course there are other possibilities, but you can understand this ). In the UTF-16, you cannot determine whether the byte "0x41" is the letter "". Sometimes, sometimes not. You must record enough status to determine the position in the stream. If one byte is lost, all subsequent data cannot be used. In the UTF-8, bytes that are lost or corrupted are easily identified and do not affect other data.

UTF-8 is not omnipotent. Apps that require random access to a specific location of the document may be able to operate faster using fixed-width encoding such as UCS2 or UTF-32. (If you consider the replacement pair, the UTF-16 is a variable-length character encoding .) However, XML processing does not belong to such applications. The XML specification requires that the parser parse from the first byte of the XML document until the last byte. all existing resolvers perform this operation. Faster Random access does not help XML processing. although using different encodings for databases or other systems may be a good reason, it is not applicable to XML.

Conclusion

In an increasingly internationalized world, language and political boundaries are becoming increasingly vague, and character sets dependent on regions are no longer applicable. Unicode is the only character set that can interwork across many regions. UTF-8 is the best Unicode encoding:

Extensive tool support, including optimal compatibility with legacy ASCII systems.

The processing is simple and efficient.

Anti-misoperation.

The platform is independent.

The stop on the character set and encoding debate, select the UTF-8, end the dispute.

The above is a detailed description of the use of UTF-8 for XML document encoding details, more please pay attention to the first PHP community other related articles!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.