As a programmer, a Chinese programmer, want to " garbled " problem basically have encountered, also for it headache. Garbled problem is the root cause of coding and decoding using a different and incompatible "standard" in the domestic generally appear in the Chinese encoding and decoding process.
Our usual coding has unicode,gbk,ascii,utf8,utf16,iso8859-1 and so on, to understand the relationship between these encodings, it is not difficult to comprehend the "garbled" the cause of the emergence and solutions.
The so-called character set encoding is actually the character (including English characters, special symbols, control characters, numbers, men, etc.) and a computer in a number (binary storage) one by one to correspond to this number to represent the character, store the character when the number is stored. For example a corresponds to the number 97. Therefore, it is very simple to understand the coding, which is a correspondence between characters and numbers .
ASCII encoding:
The computer first appeared in the United States, so the old American code only need to write to 26 English characters and the usual character of the corresponding numbers can be, this correspondence is ASCII(American standard Code for information Interchange, U.S. Information Interchange Standard Code) code. The standard ASCII code uses a 7-bit binary number to represent all uppercase and lowercase letters, numbers 0 through 9, punctuation, and special control characters used in American English. This can represent 27 = 128 characters. The highest bit constant of the standard ASCII is 0 and is not used.
The ASCII code for the output English characters in Java is as follows:
Public class Testcode { publicstaticvoidthrows Exception { int code= ' a '; SYSTEM.OUT.PRINTLN (code);} }
Output:
97
Iso8859-1:
With the popularization of computers, computers have been used all over the world. There are new requirements for character coding in different languages, and the 128 characters of original ASCII have become seriously deficient. What about that, the ASCII code is not used only 7 bits in one byte, but also the remaining 1 bits? Then hurry up and use it! So people extend the code to 8-bit, or 256-character encodings, which is iso8859-1. This extension preserves compatibility with ASCII, where the iso8859-1 encoding with the highest bit 0 is equivalent to the ASCII code.
Use Java to output a iso8859-1 character as follows:
Public class Testcode { publicstaticvoidthrows Exception { Char code=0xa7; SYSTEM.OUT.PRINTLN (code);} }
Output:
§
GBK Code:
When the computer entered China, people had a headache. There are more than 6,000 commonly used Chinese characters, such as ASCII with a byte to encode the explosion is not enough ah. But this hard-to-mind Chinese, we set the standard directly: characters less than 127 are the same as the original meaning (keep compatibility with ASCII), but two characters greater than 127 are connect prompt together, representing a single character. So we can get out of it. More than 7,000 Simplified Chinese characters are encoded. In addition, the codes encode two bytes of punctuation, numbers, and letters that are already in the ASCII code, which is commonly called "full-width" characters. This code is GB2312.
But Chinese characters are too many, GB2312 or not enough to use, some of the characters are not commonly used to show AH. So we had to continue digging into the potential of GB2312, simply requiring the first byte to be greater than 127 regardless of the size of the latter byte. This extended encoding scheme is called GBK.
For the Chinese "Hello" word GB2312 encoded output (GBK output the same):
Public class Testcode { publicstaticvoidthrows Exception { = "Hello" ; byte [] Code = s.getbytes ("gb2312"); for (byte b:code) { & 0xFF) + ""); }}}
Output:
C4 E3 BA C3
Unicode code:
China has built GBK code, and other countries, they have to show their own words ah. So each country has developed a set of their own coding standards, the results of each other who do not know whose code, incompatible. This is not good, so ISO (international standard who organization) had to stand out to say: "You do not have their own code, I give you a unified set!" ”。 So ISO has a globally uniform character set encoding scheme called UCS (Universal Character set), commonly known as Unicode.
The Unicode standard was first released in 1991, and the currently applied version is UCS-2, even with two bytes of encoded characters. This can theoretically encode a total of 216 = 65,536 characters, basically can meet the needs of a variety of languages.
UTF8, UTF16 Code:
In fact, Unicode code has been a perfect solution to the problem of internationalization of coding, that UTF8 and utf16 is God horse East, to solve what problem?
As we have already said, coding is just a correspondence between characters and numbers, which is a purely mathematical problem, and it has nothing to do with computers, storage and the Internet. The Unicode code is such a correspondence, it does not involve how to store and transfer problems. Take a look at the following example:
If a character has a Unicode encoding of 0XABCD, which is two bytes. Which byte is in the back of the previous one when it is stored? Which byte is transmitted first in the network transmission? Computer read from file to 0xabcd How do you know if this is two ASCII or a Unicode code?
Therefore, a unified storage and transport format is required to mark Unicode codes. This unified implementation is known as the Unicode conversion format (Unicode Transformation format, referred to as UTF). Encoding UTF8 and UTF16 are generated.
Where UTF16 corresponds exactly to the 16-bit Unicode code. On Mac and normal PCs, the understanding of byte order is inconsistent. For example, the Mac is read from a low byte, so the previous 0XABCD if stored in the order you see, it will be considered 0xcdab by the Mac, and Windows will start reading from the high byte, get 0XABCD, so that the Unicode code table corresponding to the character is inconsistent.
Therefore, UTF16 uses the concept of the big-endian (Big-endian, abbreviated UTF-16 BE), the small-endian (Little-endian, abbreviated UTF-16 LE) , and the BOM (byte order mark) . If you write some Chinese characters on Windows with Notepad and save them in Unicode code format and then open with the hexadecimal viewer, you can see that the first two bytes of the file are 0xFFFE (0xFFFE does not correspond to the characters in the Unicode code). Used to mark the use of small-order storage (the Windows platform uses small-endian default),
The hexadecimal data on windows7 for the word "hello" in Chinese.
If you use the Java program to output the word "Hello" UTF16 code, then the following:
Public class Testcode { publicstaticvoidthrows Exception { = "Hello" ; byte [] Code = s.getbytes ("UTF16"); for (byte b:code) { & 0xFF) + ""); }}}
Output:
Fe FF 4f 7d
You can see that the Java default output is the big-endian UTF16 encoding (BOM 0xFEFF).
Since Unicode is unified with 16-bit binary encoded characters, imagine an English article if you use UTF16 to store a whole lot more storage space than ASCII storage (the English characters of Unicode code high byte is 0), so white waste let people unkind ah. So UTF8 was born. UTF8 is a variable-length encoding that uses different storage lengths depending on the Unicode code values. So the question comes again, since it's a growing system. How do you know how many bytes represent a character encoding? For this type of problem, the common approach in computers is to use flag bits, just like the partitioning of IP segments. Specific as follows:
0xxxxxxx, if it is such a 01 string, that is, 0 after the beginning of what is not to control the XX represents any bit. It means to make a byte as a unit. It's exactly the same as ASCII.
110xxxxx 10xxxxxx. If this is the format, then put two bytes as a unit
1110xxxx 10xxxxxx 10xxxxxx If this is the format then it is three bytes when a unit.
The UTF8 code for "Hello" output in Java is as follows:
Public class Testcode { publicstaticvoidthrows Exception { = "Hello" ; byte [] Code = s.getbytes ("UTF8"); for (byte b:code) { & 0xFF) + ""); }}}
Output:
E4 BD A0 e5 a5 BD
We can keep up the face. The first byte of the word "you" 0xe4 the high four-bit binary is 1110, so this is a three-byte encoding, and the system recognizes that it reads three bytes at a time and then combines them into Unicode numbers, and then it can correspond to the character "you". The character "good" is similar.
The development of Unicode code
Unicode codes with 16-bit encoded world characters are still a bit stretched. As a result, starting with the Unicode 3.1 version, 16 auxiliary planes (equivalent to Unicode codes and 4-bit extensions) have been set up, enabling the use of Unicode to increase the available space from 60,000 to about 1 million words. In white words is added several sections, such as the original version of the Unicode code range is 0x0000 ~ 0xFFFF, the first auxiliary plane range is 0X10000~0X1FFFD, the second auxiliary plane range is 0x20000 ~ 0x2fffd,......
The latest version of the Unicode code specification proposes a UCS-4, even with 4 bytes of Unicode encoding. Similar to the previous utf16, for UCS-4 Unicode code, you can use utf32 to store, also need to define the size of the end-order and BOM information.
UrlEncode
Talk about coding in the computer (UNICODE,GBK,ASCII,UTF8,UTF16,ISO8859-1, etc.)