From the XML file garbled problem, explore the principle behind it

Source: Internet
Author: User

Because the user response to this picture can not be displayed, because the time relationship could not be corrected. Please visit the original address:

This article from http://blog.csdn.net/dinglang_2009/article/details/6895355, reproduced please indicate the source.

In daily development work, we often use XML, which has long been a standard. Its use is very extensive, but these are not the focus of this article discussed.

I believe that everyone in the beginning often encountered the "garbled" problem, which is a very headache for Chinese programmers. I have always wanted to delve deeply into the principle of "coding", but the level of helplessness is limited, those boring theories (binary, ascii,unicode,utf-8,gb2312,iso ...). Light these let me see two eyes black, really can't see, also difficult to really understand understand. I hope you have more tips ...

I will use the work encountered in an "XML file garbled" Simple problem, solve the problem, analyze the principle behind it.

First, we create a new text file locally, changing the suffix name to ". XML ", then use Notepad to open it and add some content that conforms to the XML document specification. :

After writing, press "Ctrl+s" to save, then use IE browser to open the XML file, verify the specification and correctness of the XML document. Unexpectedly, actually parse the error, as follows:

What is this? The format of my XML document definition seems to be fine. Invalid character? This is certainly a typical "coding" problem. Smart I think of the first, adjust the "code" IE browser.

But open "view" "encoding", found that the encoding format is all gray, it seems that can not choose Oh. This is because, when defining an XML document, the encoding format is specified as "UTF-8", which is equivalent to telling the browser (XML parsing engine) that you must use "UTF-8" encoding to parse me, so you cannot use the other encoding format to view it.

This is because, when using Notepad to save the document, we did not choose the encoding format, the default is the operating system encoding (Chinese version of the system), that is, the corresponding "GB2312" encoding. When our IE browser, and then use our designated UTF-8 encoding to parse the XML document, there is garbled, so caused by the above error. (Files in Windows are saved on your hard disk, using the operating system encoding by default.) For example, our XML document defined in the "China" the word, save, if its corresponding GB2312 may be "10001", and in UTF-8 encoding, "10001" corresponds to "China", or can not find, or garbled, so IE refused to show). So what should we do? There are two ways to solve this problem.

First, when we define the XML document, we specify that it is encoded as gb2312, as shown in:

After saving, we then use IE browser to open, the result

Congratulations, this problem has been solved. However, this method is not recommended for use. Since we are defining XML documents, we generally use UTF-8 encoding for the purpose of document commonality.

The second method:

We use Notepad to open the document, click "Save As", found below there will be "encoding" option, select "UTF-8" and then try again.

In fact, we are using development tools such as Eclipse or Microsoft Visual Studio to define XML documents without encountering the above problems. The reason is that these Ides are very "smart", and your XML document specifies the encoding format that the IDE automatically uses when it saves the XML document to the hard disk. So, a lot of programmers who are limited to some kind of IDE development don't really understand the knowledge and the rationale behind it, but they do it just as easily. In the early years, I understand that there are many domestic Daniel, writing code are used editplus such as text editor, and those in Linux/unix above the Daniel, many are encoded with Vi/vim. Maybe that's the difference. Oh Of course this is not the focus of this article)

Add a bit of theoretical knowledge, and continue reading if you don't faint.

At first, there was only one character set--ansi ASCII character set on the Internet, which used 7 bits to represent a character, representing a total of 128 characters, including letters, numbers, punctuation, and characters commonly used. After that, the extension, using 8 bits to represent a character, can represent 256 characters, mainly in the original 7 bits character set based on the addition of some special symbols such as tabs.

  Later, due to the addition of languages, ASCII has not been able to meet the need for information exchange, so, in order to be able to express other countries ' text, countries have developed their own character set on the basis of ASCII, these derived from ANSI standard character sets are customarily referred to as the ANSI character set, Their formal names should be MBCS (Multi-Byte chactacter system, or multibyte character systems). These derived character sets are characterized by ASCII 127 bits-based, compatible with ASCII 127, they use a encoding greater than 128 as a leading byte, immediately following the second (or even third) character of the leading byte with leading Byte together as the actual encoding.  There are many such character sets, one of our common GB-2312. For example, in the GB-2312 character set, the "Connected" encoding is C1 AC CD A8, where C1 and CD are leading Byte. The first 127 encodings are reserved for standard ASCII, for example "0" is encoded in 30H (30H for hexadecimal 30).  When the software reads, if see 30H, know that it is less than 128 is the standard ASCII, means "0", see C1 greater than 128 know that there is an additional code behind it, so C1 ac together constitute an entire code, in the GB-2312 character set to denote "Lian". Because each language has its own character set, resulting in the final existence of a variety of character sets is too many, in international communication to convert the character set very inconvenient. Therefore, the Unicode character set is presented, which is fixed using the bits (two bytes, one word) to represent a character, which can represent 65,536 characters. The characters commonly used of almost all languages in the world is included, which facilitates the exchange of information. Standard Unicode is called UTF-16. Later, for the two-byte Unicode to be able to transfer correctly on the existing processing of a single-byte system, UTF-8 was used to encode Unicode in a manner similar to MBCS. Note that UTF-8 is encoded, and it belongs to the Unicode character set.  The Unicode character set has several encodings, and ASCII has only one, and most MBCS (including GB-2312) have only one. For example, the "connected" two-word Unicode standard encoding UTF-16 (big endian) is: DE 8F 1 A 90 and its UTF-8 encoded as: E8 BF 9E E9 9A Finally, when a software opens a text, the first thing it does is decide what kind of text is used Which encoding of the character set is saved. The software has three ways to determine the character set and encoding of text: The most standard way is to detect the beginning of the textA few bytes, as in the following table:

Opening byte charset/encoding EF BB BF UTF-8 fe FF utf-16/ucs-2, little endian ff Fe Utf-16/ucs-2, big endian ff FE 00 XX utf-32/ucs-4, little endian. The FE FF utf-32/ucs-4, Big-endian.

For example, after inserting the tag, the UTF-16 (big endian) and UTF-8 codes connected "Two words are: FF FE DE 8F 1 A, EF BB BF E8 BF 9E E9 9A

However, MBCS text does not have these character set tags at the beginning, and more unfortunately, some early and poorly designed software does not insert these character set markers at the beginning when saving Unicode text. Therefore, software cannot rely on this approach. At this point, the software can take a more secure way to determine the character set and its encoding, that is, pop up a dialog box to ask the user, for example, to drag the "connected" file into MS Word, Word will pop up a dialog box.

If the software does not want to bother the user, or it is inconvenient to ask the user, it can only take its own "guess" method, the software can be based on the characteristics of the whole text to guess which charset it may belong to, which is probably not allowed. This is the case when you use Notepad to open the "Connected" file.

We can prove this: after typing "connect" in Notepad, select "Save As" and you will see "ANSI" in the Last drop-down box. When you open the "Connect" file garbled, and then click "File", "Save as", you will see the last drop-down box shows "UTF-8", which means that Notepad believes that the current open text is a UTF-8 encoded text. And we just saved it with the ANSI character set. This means that Notepad guesses the character set of the "Connected" file and thinks it is more like a UTF-8 encoded text. This is because the "connected" two-word GB-2312 encoding looks more like the UTF-8 encoding, which is a coincidence, not all text. You can use the open feature of Notepad to display it correctly when you open the "Connect" file by selecting ANSI in the Last drop-down box. Conversely, if you saved it as UTF-8 encoding before you saved it, the problem will not occur if you open it directly.

If you put the "connect" file into MS Word, Word also thinks it is a UTF-8 encoded file, but it cannot be determined, so a dialog box will pop up asking the user, then select "Simplified Chinese (GB2312)" to open it normally. Notepad does a bit more simplification at this point, which is consistent with the positioning of the program.

This article from http://blog.csdn.net/dinglang_2009/article/details/6895355, reproduced please indicate the source.

From the XML file garbled problem, explore the principle behind it (go)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.