Original article: http://www.cnblogs.com/nuaalfm/archive/2008/09/12/1290140.html
Joel (the cool man who wrote "Joel says software") of Microsoft once said: "Every software developer must and must have at least Unicode and Character Set knowledge (without any exception)", and I often suffer from Character Set conversion and many other problems, so this time I made up my mind to clarify him.
I. ASCII code
We know that in a computer, all information is eventually represented as a binary string. Each binary bit has two states: 0 and 1. Therefore, eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols, from 00000000 to 11111111.
In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now. The ASCII code consists of a total of 128 characters. For example, the space is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte, and the first one digit is set to 0.
In C #, if you want to see the ASCII code of a letter, you can use encoding,CodeAs follows:
String S = "";
Byte [] ASCII = encoding. ASCII. getbytes (s );
We can see in the debugger that the ASCII value is 97, that is, the ASCII code of A is 97 (1100001)
Ii. Non-ASCII Encoding
It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For example, E in French is encoded as 130 (Binary 10000010 ). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.
However, there are new problems. Different countries have different letters. Therefore, even if they all use 256 characters, they represent different letters. For example, 130 represents é in French encoding, but gimel in Hebrew encoding, and another symbol in Russian encoding. However, in all these encoding methods, 0-represents the same symbol, but the difference is only the 128-255.
As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is gb2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters. In C #, if you want to see the gb2312 encoding of a Chinese character, you can use the following code:
String S = "beam ";
System. Text. Encoding gb2312 = system. Text. encoding. getencoding ("gb2312 ");
Byte [] GB = gb2312.getbytes (s );
At this time, there are two numbers in GB: 193 (11000001), 186 (10111010)
Iii. Unicode
As mentioned above, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.
As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.
Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. If you want to see the Unicode encoding of a Chinese character in C #, you can use the following code:
String S = "beam ";
Byte [] Unicode = encoding. Unicode. getbytes (s );
At this time, Unicode contains two numbers: 129 (10000001), 104 (1101000)
Iv. Unicode Problems
It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.
For example, the Unicode of the Chinese character "beam" is (110100010000001), that is, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.
There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, two to three bytes in front of each English letter must be 0, which is a huge waste for storage. Therefore, the size of the text file is two or three times larger, which is unacceptable.
The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.
5. UTF-8
With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 and UTF-32, but are basically not needed on the Internet. Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.
The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.
UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.
Conversion Relationship between Unicode and UTF-8
UCS-2 Coding |
UTF-8 byte stream |
U-00000000-U-0000007F: |
0 xxxxxxx |
U-00000080-U-000007FF: |
110 XXXXX 10 xxxxxx |
U-00000800-U-0000FFFF: |
1110 XXXX 10 xxxxxx 10 xxxxxx |
U-00010000-U-001FFFFF: |
11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
U-00200000-U-03FFFFFF: |
111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
U-04000000-U-7FFFFFFF: |
1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
For example
We use code
String S = "beam ";
Byte [] Unicode = encoding. Unicode. getbytes (s );
Byte [] utf8 = encoding. utf8.getbytes (s );
You can see through the debugger
Here, the data in the memory is arranged from high to low, and the 104 hexadecimal system is 81 in the 68,129 hexadecimal system, that is, the Unicode of the "beam" is 6881 in the hexadecimal system, the binary value is 110100010000001. We can see from the above table that 6881 should belong to the third row (800-ffff ), therefore, the "beam" UTF-8 encoding requires three bytes, that is, the format is "1110 XXXX 10 xxxxxx 10xxxxxx ". Then, from the last binary bit of the beam, enter X in the format from the back to the front, and fill in 0 with the extra bit. In this way, the UTF-8 of the "beam" is "111001101010001010000001", according to No 8-bit conversion to decimal is 230,162,129. The value is exactly the same as that in utf8.
C # UTF-8 to gb2312
Net memory strings are Unicode, so the testProgramIt is not easy to write in the console application. Please write it by yourself according to the following code:
Code
String Utf8togb2312 ( String Str)
{
String Gb2312info = String . Empty;
Encoding utf8=Encoding. utf8;
Encoding gb2312=Encoding. getencoding ("Gb2312");
Byte[] Unicodebytes=Utf8.getbytes (STR );
Byte[] Asciibytes=Encoding. Convert (utf8, gb2312, unicodebytes );
Char [] Asciichars = New Char [Gb2312.getcharcount (asciibytes, 0 , Asciibytes. Length)];
Gb2312.getchars (asciibytes, 0 , Asciibytes. length, asciichars, 0 );
Gb2312info = New String (Asciichars );
Return Gb2312info;
}
VII. Advantages of utf8
UTF-8 is the world's common language encoding, if other languages in the operating system to access the gb2312 encoding website, you need to download the Language Pack, so for the sake of the universality of the site, utf8 encoding is a better choice, but in comparison, gb2312 is less than the data obtained by the UTF-8.
8. garbled problem:
If there is a string in the memory, file, or email, you should know what encoding scheme it uses, otherwise it cannot be correctly interpreted or displayed to the user. If there is no equivalent content for the encoding scheme to be used, a small question mark "?" is usually displayed. Or a box is displayed. Net in the memory of the string is Unicode, and Asp.net program is UTF-8 encoding by default, we use some strings appear garbled, we first need to determine whether we interpret the encoding method error.