1. character encoding historyCharacter encoding history, here we introduce yuanyou's article: http://www.cnblogs.com/KevinYang/archive/2010/06/18/1760597.html computer was first invented to solve the problem of digital computing, and later people found that the computer can do more, for example, text processing. However, because A computer only recognizes "Number", people must tell the computer which number represents A specific character. For example, 65 represents the letter 'A', 66 represents the letter 'B', and so on. However, the correspondence between characters and numbers on computers must be consistent. Otherwise, the characters displayed for the same number on different computers are different. Therefore, the American National Standards Association ANSI has Set a standard that specifies a Set of common characters and the numbers corresponding to each Character. This is the ASCII Character Set, also known as the ASCII code. At that time, computers generally used 8-bit bytes as the smallest storage and processing unit. In addition, there were very few characters used at that time, and 26 uppercase/lowercase English letters and numbers plus other commonly used symbols, because there are less than 100 ASCII codes, 7 bits can be used to efficiently store and process ASCII codes. The remaining 1 bits are used as the parity of some communication systems.
2. structural features of each Encoding
Since then, there have been insufficient use cases in various countries, so the format versions have gradually emerged.
UTF-8: encode part of the character into one byte, part of the character into two bytes, part of the character into three bytes, part of the character into four bytes. encode the values lower than 128 (0X0080) into one byte (), and (0X0080-0X07FF) into two bytes (Europe, East Asia). The values above 0 X are encoded in three bytes, finally, the proxy is encoded into 4 bytes.
UTF-16: encodes each 16-bit character into two bytes, so performance is good because there is no compression processing. It is also called UNIOCDE Encoding
UTF-32: uses four bytes to encode all characters, seemingly omnipotent, but with low processing performance.
UTF-7: has been eliminated by UNICODE Association.
ASCII: encodes 16 characters into ascii characters. The 16 characters smaller than 128 characters will be saved in a single byte, so the efficiency is good. The characters exceeding 0X07FF cannot be converted, otherwise, the character value will be lost.
3. C # encoding and decoding example
Reference http://blog.csdn.net/xyjnzy/article/details/5072057 here
// 1. Obtain the location code of Chinese Characters
Byte [] array = new byte [2];
Array = System. Text. Encoding. Default. GetBytes ("ah ");
Int i1 = (short) (array [0]-''/0 '');
Int i2 = (short) (array [1]-''/0 '');
// 2. Chinese character codes in unicode decoding mode
Array = System. Text. Encoding. Unicode. GetBytes ("ah ");
I1 = (short) (array [0]-''/0 '');
I2 = (short) (array [1]-''/0 '');
// 3. unicode deserialization for Chinese Characters
String str = "4a55 ";
String s1 = str. Substring (0, 2 );
String s2 = str. Substring (2, 2 );
Int t1 = Convert. ToInt32 (s1, 16 );
Int t2 = Convert. ToInt32 (s2, 16 );
Array [0] = (byte) t1;
Array [1] = (byte) t2;
String s = System. Text. Encoding. Unicode. GetString (array );
// 4. undecodes Chinese Characters in default mode
Array [0] = (byte) 196;
Array [1] = (byte) 207;
S = System. Text. Encoding. Default. GetString (array );
// 5. Obtain the string length
S = "iam square gun ";
Int len = s. Length; // will output as 6
Byte [] sarr = System. Text. Encoding. Default. GetBytes (s );
Len = sarr. Length; // will output as 3 + 3*2 = 9
// 6. Add strings
System. Text. StringBuilder sb = new System. Text. StringBuilder ("");
Sb. Append ("I ");
Sb. Append ("am ");
Sb. Append ("square gun ");
String --> byte array
Byte [] data = Syste. Text. Encoding. ASCII. GetBytes (string );
String --> byte
Byte data = Convert. ToByte (string );
Byte [] --> string
String = Encoding. ASCII. GetString (bytes, 0, nBytesSize );
4. Use of the Encodiing class
The Encodiing class provides many static attributes such as Unicode, UTF32, UTF7, ASCII, and Default. They return an object for processing the corresponding character encoding, it is worth noting that the Default attribute is used for the supplementary medicine, because the program you developed will be affected by the running computer, and it will use the Default character encoding solution in the current computer.
If you think it is good, please support it.