Encountered garbled not afraid of--computer character code detailed explanation

Source: Internet
Author: User

Download a document, an open found is garbled, not crazy to blame ... As you all know, this is all a curse of character coding. ASCII, ANSI, GB18030, Unicode, UTF-8, UTF-8 with BOM, UTF without BOM, UTF-16, Utf-16le, utf-16be ... A big lump of who can be divided into clear? I heard that UTF-8 is Unicode, but how to save options in Windows Notepad has the UTF-8 and Unicode two options yes?! How do all kinds of software determine what code a file is? Why do you sometimes judge mistakes? Let me take one by one.

There is no character encoding in the world. Since we had a computer, we had the need to record text in 0 and 1, so the characters were encoded.

= = ASCII = =

"Hello Guokr" (decimal) represented by ASCII encoding: 72 101 108 111 32 71 117 111 75 114

ASCII is the most basic encoding, it defines the 0~127 corresponding characters, including the most basic English letters, punctuation. It cannot represent Chinese. ASCII encoded text, each byte is 0~127, if a byte is greater than 127, then it must not be ASCII encoded.

= = gb*/ANSI = =

In order to record and display Chinese in computer, the Chinese invented the GB series code. GB Series encoding defines the encoding of Chinese characters and punctuation. According to the GB series encoding, in a text, if a byte is 0~127, then the meaning of this byte is ASCII-encoded, otherwise, this byte and the next byte together to form a Chinese character (or GB encoding definition of other characters). Therefore, GB Series encoding is backwards compatible with ASCII, that is, if all the characters in a piece of GB encoded text are defined in ASCII, then this encoding is exactly the same as the ASCII encoding. GB code early included in less than 10,000 Chinese characters, basic to meet the daily use needs, but does not contain some uncommon words, and later in a new version added. The first GB encoding is GB2312, later GBK, the latest is GB18030, joined some of the national minority text, some rare words are compiled to 4 bytes, each extension is completely retained before the previous version of the code, so each new version is backwards compatible.

Similarly, Japanese, Korean, and world languages have their own encodings (if ASCII does not meet the usage requirements). These codes are similar to GB encodings, are compatible with ASCII and represent one word in two bytes. All of these national text encodings, Microsoft is collectively known as ANSI. So even if we know about ANSI, we need to know which language to decode, because they all conflict with each other. In addition, you cannot use an ANSI code to represent both Chinese and Korean characters.

Wait, I misled you. In fact, Microsoft misled everyone, strictly speaking, ANSI is not a character encoding, but a non-profit organization in the United States, they do a lot of standard-setting work, including the C language specification ANSI C, as well as the national text encoding corresponding to the "code page". ANSI rules the simplified Chinese GB code page is 936, so GB code is also called ANSI code page 936 (according to the ANSI standard codepage 936), the National code is collectively referred to as ANSI Origin-this is an outrageous historical error, it is like I gave a domestic university rankings , ranking is based on the probability of eating insects in the canteen, then the domestic universities are collectively referred to as butter cats. But for these messy and conflicting multi-country character encodings, there is a general or good (I have to thank Microsoft), the following discussion to continue to use ANSI.

==unicode/utf/ucs==

In order to solve the problem of conflict and promote world peace, we have Unicode (union code?). )--a code that collects all the characters from all languages of the world and makes them coexist ... Oh, wait, Unicode is a character encoding, but strictly speaking it can't be compared to GB18030: it only defines a single integer for each character (currently contains more than 100,000 characters, where 0~127 and ASCII are exactly the same), but it does not define how the integer becomes a byte. When you tell me that this data is Unicode encoding, ah, sorry, I still do not know how to decode-because there is more than one format for the byte stream, they are called "Unicode Conversion format" (Unicode transformation, abbreviated as UTF).

So there's a whole bunch of utf:utf-8, UTF-8 with BOM, UTF-8 without BOM, UTF-16, Utf-16le, utf-16be ... There are very rare UTF-32, early also heard of UCS-2, UCS-4 ... _ (: З"∠) _

Wait, why is there a save option in Windows Notepad that is Unicode? This later says.

First of all, the most common UTF-8: it makes a character into 1-4 bytes, where one byte character is exactly the same as ASCII, so it is backwards compatible with ASCII. Similar to ANSI, UTF-8 the first byte determines how many bytes are a good group of friends. Most Chinese characters are 3 bytes in UTF-8, and some rare characters will be made up to 4 bytes.

We ushered in the first incompatible ASCII encoding: UTF-16. The UTF-16 is a single unit per 2 bytes, each character consists of 1-2 cells, so each character may be 2 bytes or 4 bytes, including the most common letters of the alphabet will be two bytes. Most Chinese characters are also 2 bytes, and a few uncommon characters are 4 bytes. UTF-16 also, the order of two bytes in a cell is not unique. Students who have learned computer principles know that a computer represents an integer in two formats: a low position at the front, or vice versa. For example, using two bytes to represent the integer 260, it may be:

Low in front: 04 01 (260=4+256*1)

High in front: 01 04 (260=256*1+4)

Low in front of the UTF-16 called Utf-16le, high in front of the call Utf-16be. At present most of the computer systems use low-level integer format, so if there is no declaration, UTF-16 default is Le.

The early Unicode incorporation of the word is not too long, two bytes enough to represent all characters, so there is a fixed to two bytes of UTF, called UCS-2. The two-byte portion of the UTF-16 is exactly the same as the UCS-2, so the UTF-16 is backwards compatible with UCS-2. UCS-2 is equally divided between Le and be. While Windows Notepad also has the so-called Unicode in other parts of Windows, modern Windows is actually Utf-16le, which is ucs-2le in Windows XP and earlier versions. Microsoft (again Microsoft) is the culprit in confusing the Unicode concept, Microsoft you hate your family so much, you know?

In addition, UTF-32 and UCS-4 are fixed to a character of 4 bytes, equally divided by Le and be.

Not yet, so many character encodings, how do you know which code is when the software opens? So I don't know who's in the pit. BOM: Before a text file or a character encoding with a few fixed bytes for identification, these bytes are guaranteed to not correspond to any one character, so the software can Yue Heyue the first reading:

EF BB BF-I'm UTF-8.

FF FE-I'm Utf-16le

FE FF-I'm utf-16be.

(No BOM, straight to the subject)-you guess?

Good, no BOM can only rely on guessing. When the software reads the file, you can try all the codes and see which one looks like. In addition, the BOM only for Unicode series encoding, ANSI will not have a BOM. Obviously, there is no BOM to accidentally guess wrong. Online is a magical joke: open windows Notepad, into the "unicom" two words, save, close, and then open, turned into a black block. Notepad with ANSI (GB18030) to save the two words, just the GB18030 code of the two words look like UTF-16, so as UTF-16 to open ...

The BOM sounds good, but it's actually a nasty design, because it's incompatible with many protocols and specifications, which is a digression.

So, UTF-8 with BOM, UTF-16 without BOM you will understand. And so on, if not to mention the BOM, there is a BOM or no BOM? --again a very tangled problem, Windows software is generally the default BOM, and other systems are not bom--by default may be because Windows is often compatible with ANSI reasons, especially rely on the BOM to prevent the wrong meaning.

= = who is Orthodox = =

So many kinds of coding, to write the article just, open a document see garbled to quit, change programs or change the code to open, but write the program can be a bit sloppy, all kinds of program development environment on the encoding support is not the same, if the code is not well, you write a good program may not be able to run on someone else's computer. If I develop cross-platform code and have Chinese (note), what is the coding good? The following are the common coding scenarios that I know of each development environment/compiler support (all =ansi, UTF-8 BOM and no BOM, Utf-16le BOM and no BOM):

    • Visual Studio: All, save the default ANSI.
    • VC compiler: All except UTF-8 without BOM (directly as ANSI)
    • Windows Notepad: All, save default ANSI, cannot save without BOM.
    • XCode: Only UTF-8 without BOM is supported.
    • GCC: All
    • Vim: All, save defaults to system default encoding, generally UTF-8 without BOM.
    • Eclipse: All, save default unknown, Mac incredibly is ANSI (we have a traitor).

ANSI is not cross-border, with GB written documents taken to Korea is determined garbled. Just the Simplified Chinese system, ANSI is also often recognized, eclipse often (not always) open ANSI file is garbled, this is because ANSI has no obvious characteristics. The text editor for Xcode and Mac opens ANSI directly garbled because it is explicitly not supported. ANSI tolerance is generally poor, one byte wrong may cause the following words to hang out. To keep your own data safe from the problems of Eclipse and other things, keep away from ANSI.

UTF-16 is not compatible with ASCII, the C language is not compatible with the String processing library function (because of the byte stream), in addition to Windows love, other systems hate it. BOM and many protocol specifications conflict, a lot of software is resisted, but also only windows commonly used, and it is listed as orthodox (anti).

Comprehensive, cross-platform development please use UTF-8 without BOM, that is the most common encoding, is the default encoding of many software systems, you are looking at the page also use it. It features obvious, in addition to VC compiler and Microsoft's various software temporarily did not find out which software will have the situation of the mistake. It also has a well-designed fault-tolerant mechanism in which the wrong byte can be a maximum of one character. Please manually set up your development environment, the default saved encoding is set to UTF-8 without BOM, and the other encoded files to convert it, the garbled will be goodbye, Notepad and other unsupported editors, do not let them touch your files. As for the VC compiler hard to make it awkward to go to it _ (: З"∠) _

(After testing, VS2010 can normally open UTF-8 without BOM code, but may not compile, and report strange compilation errors)

Why does the program ape like black Microsoft so much? Because Microsoft's stuffy designers are always different from everyone else ...

Written at the end:

When the machine can understand the human language, we have to worry about the character coding such a low-level thing, it is incredible ... The solution is so simple: a few big companies sit around and fix a code that will be encoded in the future. And this code has surfaced, supporting maturity: UTF-8. I admire Apple, they are decisive and ANSI and a large number of non-uniform coding Ito, all the software only support UTF-8, which is really responsible for the user-and Microsoft to today is still dead hold ANSI, in this little thing on the see he is so reluctant to the past a little achievement, so not willing to leather his life.

Reprint: http://www.guokr.com/blog/763017/

Encountered garbled not afraid of--computer character code detailed explanation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.