Character Set and encoding

Source: Internet
Author: User

Character Set encoding is different, and the data storage space is also different. If you choose improperly, garbled characters may occur. In particular, the sending and receiving of data must be consistent with the encoding.

ASCII code is the earliest and most basic encoding, using 7 (BIT) to represent a character, a total of 2 to the power of 7 = 128 characters, and later with Latin1 (ISO-8859-1) ASCII is expanded. An 8-bit (BIT) character is represented by a byte, which can represent the power of 2 to 256 characters, it can represent more special characters than ASCII, but it is not enough for characters in some regions, such as Chinese. To solve this problem, Unicode encoding is used to indicate characters in all regions, specific encoding for specific regions, such as gb2312 in Chinese.

 

Unicode encoding uses two bytes to represent a single character, which can be a 16 to the power of 2 = 65536 characters. when most of the characters in a document are English or pure English, Unicode is a waste of space. UTF-8 encoding can solve this problem. It uses the same ASCII encoding in English. However, if a Chinese character is used, one character is represented by three bytes. Gb2312 uses two bytes to represent Chinese characters.

 

The encoding class in. Net located under system. Text is the core class of various encodings, providingConversion between byte arrays and characters and conversion between various encodingsThe encoding class is defined as follows:

  Public Abstract ClassEncoding: icloneable

 

The derived classes of the encoding class include the asciiencoding, unicodeencoding, and utf8encoding classes, providing overwriting of different codes.

 

The following uses the character "message, information" (English, half-width comma, Chinese) as an example to see the representation of each encoding.

 

  String  Result  =    ""  ;
String S = " Message " ;
Byte [] B = Encoding. utf8.getbytes (s );
// Byte [] B = encoding. Unicode. getbytes (s );
// Byte [] B = encoding. getencoding ("gb2312"). getbytes (s );
Foreach ( Byte I In B)
{
Result + = I. tostring () + " , " ;
}

 

 

The result value is "109,101,115,115, 97,103,101, 44,228,191,161,230,129,175"

  M E S S A G E , Letter Information
UTF-8 109 101 115 115 97 103 101 44 228,191,161 230,129,175
Unicode Random, 0 101,0 115,0 115,0 97,0 , 0 101,0 44,0 225,79 111,96
Gb2312 109 101 115 115 97 103 101 44 208,197 207,162

Conversion from byte array to character

 

  Byte  [] B  =     New     Byte  [] {  109  ,  101  ,  115  ,  115 ,  97  ,  103  ,  101  ,  44  ,  228  ,  191  ,  161  ,  230  ,  129 ,  175  };
String S = Encoding. utf8.getstring (B );

The value of S is "message ". The byte array is UTF-8 encoded. If you use gb2312 to getstring, the obtained Chinese characters will be garbled: Message, Qi ℃ Encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.