Understanding character encoding

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We can make an experiment, use NotePad to save the Chinese and English character strings of "China AB" in different encoding methods into multiple ". txt" files, and then directly view their binary content:

Figure 1 Comparison of character encoding

Figure 1 shows the different binary data obtained by "China AB" in four encoding methods (ANSI, UTF8, Unicode, and Unicode Big Endian.

Take the English character "a" as an example. The numbers produced by ANSI and UTF8 are both "61 ", but Unicode extended it to a 2-byte 16-bit binary ("61 00" and "00 61"), so we call this encoding method A UTF-16.

UTF-16 can be subdivided into two encoding methods: Big Endian mode and Little_Edian mode, the only difference between the two is that the byte order is just the opposite, the Little_Edian method encodes "a" into "61 00", while the Big Endian method is encoded as "00 61 ".

Now let's take a look at the Chinese character. The Chinese character "China" has two Chinese characters, and the ANSI code is "D6 D0 B9 FA". Four bytes. One Chinese Character occupies two bytes, UTF8 is encoded as "E4 B8 AD E5 9B BD", with 6 bytes. One Chinese Character occupies 3 bytes! This indicates that UTF8 is a variable-length code, which may use 1 ~ 4 bytes to indicate a character.

In addition, we can see that UTF8 and Unicode encoding (whether Big Endian or Little Endian) are preceded by several markup characters, which are placed at the beginning of a text file, known as "BOM (Byte Order Mark, indicates the encoding method of the text. the BOM values of common character encoding methods in the. NET program:

Encoding	BOM Value
UTF-8	EF BB BF
UTF-16 big endian	FE FF
Little endian UTF-16	FF FE
UTF-32 big endian	00 00 FE FF
Little endian UTF-32	Ff fe 00 00

After understanding the basic knowledge above, we can automatically detect the encoding method of the string based on the BOM value, so as to correctly decode the string from the binary data stream.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Understanding character encoding

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support