. NET handling of character encoding Problems

Source: Internet
Author: User

 

1. character encoding historyCharacter encoding history, here we introduce yuanyou's article: http://www.cnblogs.com/KevinYang/archive/2010/06/18/1760597.html computer was first invented to solve the problem of digital computing, and later people found that the computer can do more, for example, text processing. However, because A computer only recognizes "Number", people must tell the computer which number represents A specific character. For example, 65 represents the letter 'A', 66 represents the letter 'B', and so on. However, the correspondence between characters and numbers on computers must be consistent. Otherwise, the characters displayed for the same number on different computers are different. Therefore, the American National Standards Association ANSI has Set a standard that specifies a Set of common characters and the numbers corresponding to each Character. This is the ASCII Character Set, also known as the ASCII code. At that time, computers generally used 8-bit bytes as the smallest storage and processing unit. In addition, there were very few characters used at that time, and 26 uppercase/lowercase English letters and numbers plus other commonly used symbols, because there are less than 100 ASCII codes, 7 bits can be used to efficiently store and process ASCII codes. The remaining 1 bits are used as the parity of some communication systems.

 

 

2. structural features of each Encoding

Since then, there have been insufficient use cases in various countries, so the format versions have gradually emerged.

UTF-8: encode part of the character into one byte, part of the character into two bytes, part of the character into three bytes, part of the character into four bytes. encode the values lower than 128 (0X0080) into one byte (), and (0X0080-0X07FF) into two bytes (Europe, East Asia). The values above 0 X are encoded in three bytes, finally, the proxy is encoded into 4 bytes.

UTF-16: encodes each 16-bit character into two bytes, so performance is good because there is no compression processing. It is also called UNIOCDE Encoding

UTF-32: uses four bytes to encode all characters, seemingly omnipotent, but with low processing performance.

UTF-7: has been eliminated by UNICODE Association.

ASCII: encodes 16 characters into ascii characters. The 16 characters smaller than 128 characters will be saved in a single byte, so the efficiency is good. The characters exceeding 0X07FF cannot be converted, otherwise, the character value will be lost.

 

 

3. C # encoding and decoding example

Reference http://blog.csdn.net/xyjnzy/article/details/5072057 here
// 1. Obtain the location code of Chinese Characters

Byte [] array = new byte [2];
Array = System. Text. Encoding. Default. GetBytes ("ah ");

Int i1 = (short) (array [0]-''/0 '');
Int i2 = (short) (array [1]-''/0 '');

// 2. Chinese character codes in unicode decoding mode
Array = System. Text. Encoding. Unicode. GetBytes ("ah ");
I1 = (short) (array [0]-''/0 '');
I2 = (short) (array [1]-''/0 '');

// 3. unicode deserialization for Chinese Characters
String str = "4a55 ";
String s1 = str. Substring (0, 2 );
String s2 = str. Substring (2, 2 );

Int t1 = Convert. ToInt32 (s1, 16 );
Int t2 = Convert. ToInt32 (s2, 16 );

Array [0] = (byte) t1;
Array [1] = (byte) t2;

String s = System. Text. Encoding. Unicode. GetString (array );

// 4. undecodes Chinese Characters in default mode
Array [0] = (byte) 196;
Array [1] = (byte) 207;
S = System. Text. Encoding. Default. GetString (array );

// 5. Obtain the string length
S = "iam square gun ";
Int len = s. Length; // will output as 6
Byte [] sarr = System. Text. Encoding. Default. GetBytes (s );
Len = sarr. Length; // will output as 3 + 3*2 = 9

// 6. Add strings
System. Text. StringBuilder sb = new System. Text. StringBuilder ("");
Sb. Append ("I ");
Sb. Append ("am ");
Sb. Append ("square gun ");

String --> byte array

Byte [] data = Syste. Text. Encoding. ASCII. GetBytes (string );

String --> byte

Byte data = Convert. ToByte (string );

Byte [] --> string

String = Encoding. ASCII. GetString (bytes, 0, nBytesSize );

 

 

 

4. Use of the Encodiing class

The Encodiing class provides many static attributes such as Unicode, UTF32, UTF7, ASCII, and Default. They return an object for processing the corresponding character encoding, it is worth noting that the Default attribute is used for the supplementary medicine, because the program you developed will be affected by the running computer, and it will use the Default character encoding solution in the current computer.

 

 

If you think it is good, please support it.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.