We can make an experiment, use NotePad to save the Chinese and English character strings of "China AB" in different encoding methods into multiple ". txt" files, and then directly view their binary content:
Figure 1 Comparison of character encoding
Figure 1 shows the different binary data obtained by "China AB" in four encoding methods (ANSI, UTF8, Unicode, and Unicode Big Endian.
Take the English character "a" as an example. The numbers produced by ANSI and UTF8 are both "61 ", but Unicode extended it to a 2-byte 16-bit binary ("61 00" and "00 61"), so we call this encoding method A UTF-16.
UTF-16 can be subdivided into two encoding methods: Big Endian mode and Little_Edian mode, the only difference between the two is that the byte order is just the opposite, the Little_Edian method encodes "a" into "61 00", while the Big Endian method is encoded as "00 61 ".
Now let's take a look at the Chinese character. The Chinese character "China" has two Chinese characters, and the ANSI code is "D6 D0 B9 FA". Four bytes. One Chinese Character occupies two bytes, UTF8 is encoded as "E4 B8 AD E5 9B BD", with 6 bytes. One Chinese Character occupies 3 bytes! This indicates that UTF8 is a variable-length code, which may use 1 ~ 4 bytes to indicate a character.
In addition, we can see that UTF8 and Unicode encoding (whether Big Endian or Little Endian) are preceded by several markup characters, which are placed at the beginning of a text file, known as "BOM (Byte Order Mark, indicates the encoding method of the text. the BOM values of common character encoding methods in the. NET program:
Encoding |
BOM Value |
UTF-8 |
EF BB BF |
UTF-16 big endian |
FE FF |
Little endian UTF-16 |
FF FE |
UTF-32 big endian |
00 00 FE FF |
Little endian UTF-32 |
Ff fe 00 00 |
After understanding the basic knowledge above, we can automatically detect the encoding method of the string based on the BOM value, so as to correctly decode the string from the binary data stream.