A string consists of one character. Each character is represented by one or more bytes, and each byte is represented by eight bits.
In C #, strings are usually declared by strings. characters are declared by char, bytes are represented by bytes, and bit is represented by bit. For detailed analysis, see the following test code analysis:
Complete Test code:
1 using system; 2 using system. collections. generic; 3 using system. LINQ; 4 using system. LINQ. expressions; 5 using system. text; 6 using system. threading. tasks; 7 using system. io; 8 namespace csharprumenjd 9 {10 class program11 {12 static void main (string [] ARGs) 13 {14 15 string unicodestr = "ah? /123 "; 16 console. writeline ("string:" + unicodestr); 17 console. writeline ("Length:" + unicodestr. length); 18 console. writeline ("Unicode Byte Length:" + system. text. encoding. unicode. getbytecount (unicodestr); 19 var unicodebytes = system. text. encoding. unicode. getbytes (unicodestr); 20 console. writeline ("gb2312 Byte Length:" + encoding. getencoding ("gb2312 "). getbytecount (unicodestr); 21 var gb2312bytes = system. text. encoding. getencoding ("gb2312 "). getbytes (unicodestr); 22 # region garbled Test 23 var gb2312tounidecodestr = system. text. encoding. unicode. getstring (gb2312bytes); 24 console. writeline ("gb2312bytes into Unicode string:" + gb2312tounidecodestr); 25 var gb2312str = system. text. encoding. getencoding ("gb2312 "). getstring (gb2312bytes); 26 console. writeline ("gb2312bytes string:" + gb2312str); 27 # endregion28 # region prints binary data 29 int capacity = gb2312bytes. length * 8; 30 stringbuilder sb = new stringbuilder (capacity); 31 for (INT I = 0; I <gb2312bytes. length; I ++) 32 {33 sb. append (gb2312bytes [I] + ":" + convert. tostring (gb2312bytes [I], 2 ). padleft (8, '0') + "|"); 34} 35 console. writeline (sb. tostring (). trimend ('|'); 36 # endregion37 streamwriter Sw = new streamwriter ("1.txt", false, system. text. encoding. unicode); 38 SW. write (unicodestr); 39 SW. close (); 40 streamwriter SW1 = new streamwriter ("2.txt", false, encoding. getencoding ("gb2312"); 41 sw1.write (unicodestr); 42 sw1.close (); 43 console. readkey (); 44} 45} 46}
View code
Test results:
The test result shows the same string,
The length of the byte obtained by Unicode encoding is 12, and the length of the byte obtained by gb2312 is 7,
In addition, garbled characters occur when the byte array encoded by gb2312 is converted into a string using Unicode. There is no problem when the byte array encoded by gb2312 is converted into a string using the encoding method of gb2312,
Question 1:Why do the two encoding methods have different bytes?
Unicode code: a unicode code is also an international standard. It uses two-byte encoding, that is, whether a character is a digital letter or a Chinese character, its byte length is twice the length of the character,
Gb2312 encoding is a branch of ANSI encoding. It supports multiple language stages in ANSI encoding. Each character is represented by one or more bytes (MBCS). Therefore, characters stored in this way are also calledMulti-byte characters. For example, "ah? /123 "is 7 bytes in length. Each Chinese Character occupies 2 bytes, and each English or numeric character occupies 1 byte,
Development of character and encoding
From the perspective of computer support for multiple languages, there are roughly three phases:
|
System internal code |
Description |
System |
Phase 1 |
ASCII |
At the beginning, the computer only supports English, and other languages cannot be stored and displayed on the computer. |
English DoS |
Phase 2 |
ANSI Encoding (Localization) |
To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xff range to 1 character. For example, in the Chinese operating system, the byte [0xd6, 0xd0] is used for storage. Different countries and regions have developed different standards, resulting in respective coding standards such as gb2312, big5, and JIS. These two bytes are used to represent the extended encoding of each character.ANSI Encoding. In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding. Different ANSI codes are incompatible with each other. When information is exchanged internationally, texts in the two languages cannot be stored in the same segment.ANSI Encoding. |
Chinese dos, Chinese Windows 95/98, Japanese Windows 95/98 |
Phase 3 |
Unicode (International) |
To facilitate international information exchanges, international organizations have developedUnicode Character SetSet a uniform and unique number for each character in a variety of languages to meet the requirements of cross-language and cross-platform text conversion and processing. |
Windows NT/2000/XP, Linux, Java |
Question 2: What do the decimal numbers in the last row represent?
Because byte arrays are encoded in gb2312 format, you need to first understand the processing method of gb2312. In the program using gb2312, each Chinese Character and symbol is expressed in two bytes. The first byte is called "high byte" (also known as "zone Byte"), and the second byte is called "low Byte" (also known as "bit byte "), "High Byte" uses 0xa1-0xf7 (add the area code of area 01-87 with 0xa0) and "low Byte" uses 0xa1-0xfe (add 01-94 with 0xa0 ), 0x0 is converted into a 10-digit number, which is 160. "Ah" is the first Chinese Character in the gb2312 character set. Its area code is 16 and its location code is 01, and its location code is 1601,
Therefore, the High-Level bytecode is 0xa0 + 16, that is, 160 + 16 = 176, and the low-level bytecode 0xa0 + 01 is 160 + 1 = 161, which is exactly the same as that, the remaining five decimal digits match the number of the five characters after the ah word. The query is as follows:
Question 3: why are the size and bytes of the generated text file inconsistent?
The size of the file generated by gb2312 encoding format is 7 bytes, which is consistent with that printed on the console, while that generated by Unicode encoding format is 14 bytes, it is two bytes longer than the bytes printed on the console. I don't know how to explain this phenomenon.
References:
- Characters, bytes, and encoding
- Differences between bit, byte, bit, byte, Chinese character, and character
- Chinese character encoding character set for information exchange
- Gb2312 Character Set details
- ASCII code table
Summary and questions about strings, characters, bytes, and bit