Notes on character sets

Last Update:2014-08-14 Source: Internet

Author: User

Tags printable characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As you know, data in a computer is represented in binary format, with only 0 and 1. What we see on the computer screen can be understood by us are all characters, whether in Chinese or English, they will be converted to binary when stored in the computer.

So what kind of binary is used to display what kind of characters is determined by people. This is the so-called encoding.

For this reason, there are various encoding methods, such as ASCII and Unicode, as well as gb2312 and GBK related to Chinese characters.

ASCII

ASCII code is the most familiar encoding method. It was a set of standards set by the United States in the 1960s S (1961). It was originally a national standard in the United States, later, ISO was set as an international standard. Today, all machines support ASCII character sets.

ASCII code is basically a conversion of English letters and binary code, which is expressed by a total of seven binary digits, so one can represent a total of 128 characters, 0 ~ 31 is a control character (cannot be printed). There are 96 printable characters, including numbers, uppercase and lowercase English letters, and some punctuation marks.

However, because the basic processing unit of computer storage is byte, that is, 8 bits, the ASCII code is usually stored in one byte, and the maximum bit is set to 0.

From the table above, we can see that the ASCII code only supports English. If only English is used in the world, it is simple and there will be no character encoding format, but unfortunately, in fact, this is not the case.

The diversified development of the world has created a variety of languages, such as Chinese characters that we all see now, right.

Every Chinese character, like every leaf, is unique. There are only 128 ASCII code tables, and there is no way to satisfy the expression of Chinese characters.

Therefore, some Western European countries decided to use the idle highest bit in the ASCII code. In this way, they can represent 256 states using 8-bit encoding, that is to say, they can use the extended ASCII code to represent 256 characters.

For these Western European countries that are dominated by Latin letters, it is sufficient to have 256 characters.

However, there is a problem. in different countries, the original 0-ASCII code remains unchanged and remains uniform, however, the encoding in the 128-255 section is expressed in the language of the country, and the following problem occurs:

The same encoding is displayed as different characters in different countries.

This defect is not obvious when the Internet is not widely used and global communication is not so frequent, but it is obviously unacceptable in this era.

Because the computer must first know the corresponding encoding method to display any character, otherwise it will display garbled characters.

In addition, for Asian countries, especially Chinese characters, 256 characters are far from enough!

ANSI encoding and MBCS

So in order to display the language of the country, different countries and regions have developed different standards, so they have produced such as gb2312 (Simplified Chinese Character Set), big5 (Traditional Chinese Character Set) and other different encoding standards.

Because Chinese characters are far more than 256 characters, these encoding methods adopt a 2-byte encoding method, which is generally called ANSI encoding (Windows-1252) on Windows systems ), it is also known as the multi-byte character set (MBCS, multi-bytes character set ).

This is for Windows systems. In Chinese Windows operating systems, ANSI encoding is gb2312 encoding, while in Japanese operating systems, ANSI encoding is JIS encoding.

However, there is no way for different ANSI codes to convert each other, and no one knows anyone.

So we are back to the previous question. In today's era of globalization, we must have a unified character set that can contain all the characters.

Unicode

Unicode, I think it should be the combination of the universal code, which indicates that all the codes of the whole universe will be included in it. Of course, it is now the earth, not including the Mars.

Unicode stores all the common characters in the world and assigns a unique Unicode for each character. For example, U + 4e00 is the "one" of Chinese characters ", using tools such as ultraedit, we can see that the encoding format when it is saved as UTF-16 is as follows:

We are to save the file as the UTF-16 encoding format, so what is the relationship between the UTF-16 and Unicode, the above content shows the "ff fe" is what it means?

Here, we must first make it clear that Unicode is only a character set. It only defines the hexadecimal value of a symbol in Unicode, for example, the Unicode corresponding to "1" in the Chinese character above is 4e00. In addition, it does not do anything else.

Why does Unicode not define the storage structure of characters?

I think the main reason is the storage and transmission of resources. For example, for English letters in ASCII codes, they are also included in Unicode, if Unicode specifies how many bytes a character needs to be stored, it is clear that it can use more than one byte, because apart from English letters,

There are other language characters. If two or more bytes are used, and the ASCII code is not used so much, a lot of 00000 will be used for each storage, which is a waste of resources, transmission will also be efficient.

Obviously, this determines that there will be a variety of Unicode storage methods, such as the UTF-16 we talked about above, other UTF-8 we often use, and UTF-32.

Occasionally, we will see UCS-2 and UCS-4, what is their relationship with Unicode? In fact, Unicode Character Set, is called Unicode Character Set, English is Unicode character set, that is, UCOS, then the UCS-2 is to represent two bytes, that is, 16 bits to represent Unicode.

But note that, based on Wikipedia, UTF-16 is used to replace UCS-2.

Characters in Unicode represented by a UCS-2 that range from u + 0000 to U + FFFF, but in this section, U + d800 to U + dfff is retained as a non-character, while the UTF-16, is to use this reserved position to save the character set at u + 010000 to U + 10ffff.

Therefore, although both of them can only contain 2 ^ 16 characters, the range is different.

As for UCS-4, UTF-32 is the same.

Speaking of this, I believe everyone, it is clear, in fact, UTF-16, is a way to Unicode, it only implements the storage format of the characters U + 0000 to U + FFFF in Unicode.

And because it is two bytes, so basically what is the Unicode value, the implementation of the UTF-16 is what value, so that in many cases people put Unicode and UTF-16 are equal, in fact, they are definitions and storage methods.

Because there are exactly two bytes, which byte is stored in the front and the back of the computer? All of them are acceptable, which involves the issue of byte order.

Byte order

The byte order mark is also called the byte order mark, which is also the BOM, which is the FF Fe stored in the preceding format. This is a recommended method in the Unicode specification.

A unicode character, U + feff, is called "zero-width non-breaking space ". Before storing characters, add such a BOM to indicate the encoding sequence, indicating whether the CPU should first store high bytes or low bytes.

Little endian: generally called the tail. Adding FF Fe indicates that it is stored in reverse mode. The lower byte (the next byte) is saved first, and the higher byte (the previous byte) is saved ), for example, the Chinese character "one", its Unicode value is u + 4e00, if you choose the UTF-16 (little endian) in the storage mode, it will be in the storage.

Big endian: generally called a big tail. Fe FF is added before the character. In turn, the high bytes are saved first, and then the low bytes are saved as Fe FF 4E 00, as shown in.

The above, is generally the storage of the UTF-16, But it limits the storage of fixed two bytes, And the UTF-8, is a variable length of storage, it gradually becomes the most popular character storage and transmission method on the Internet.

UTF-8

UTF-8, which can use 1-4 bytes to represent a symbol, the encoding rules are as follows:

For a single-byte symbol, the first digit is set to 0, and the last seven digits are set to its Unicode value, that is, the ASCII code.

For the symbols of n (n> 1) bytes, the first N bits of the first byte are set to 1, and the nth + 1 bits are set to 0, the first two digits of each byte are fixed to 10, and then the binary digits of the Unicode value corresponding to the symbol are filled with the remaining digits in order.

The specific form is as follows:

Unicode symbol range | UTF-8 encoding method (hexadecimal) | (Binary) california + California 0000 0000-0000 007f | 0xxxxxxx0000 0080-0000 07ff | 110 XXXXX 10xxxxxx0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10xxxxxx0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

According to this table, we can look at the Chinese character "1", into the value of the UTF-8 code.

The Unicode value is 4e00, and its range is 0000-0800 FFFF. It can be seen that it is encoded in three bytes, and its binary value is 0000 0100 1110 0000, it can be seen that the UTF-8 of the binary 1110 0100 1011 1000 1000 0000, converted to hexadecimal E4 B8 80 00, then let us save the above "A" word into the UTF-8 format

As shown in the result, what is ef bb bf?

The UTF-8 is 8 bits for Unicode encoding, so it is encoded in bytes, so there is no problem of the byte order, ef bb bf is actually encoded with Fe ff.

Since there is no byte order problem, why should we add this BOM before?

This is because the BOM can be used to indicate the encoding method. When receiving a string starting with ef bb bf, we know that the subsequent string is encoded in UTF-8.

Okay, it's over!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More