The Mystery of character encoding

Last Update:2015-06-27 Source: Internet

Author: User

Tags coding standards

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Character encoding

Do you think "ASCII = one character is 8 bits"? Do you think a byte is a character, and a character is 8 bits? Do you still think you still think that UTF-8 is using 8 bits to represent a character? If you really think so, read this article carefully!

Why do I have to encode?

First of all, we need to be clear that all the data in the computer is stored in the form of bytes and processed. We need these bytes to represent the information in the computer. But the bytes themselves are meaningless, so we need to give practical meaning to these bytes. That's why the coding standards are set.

Coding model

The first thing to be clear is that there are two kinds of coding models

Simple Character Set

In this coding model, a character set defines what characters are contained in the character set, and defines how each character corresponds to a bit in the computer. ASCII, for example, is defined directly in ASCII A -> 0100 0001 .

Modern coding model

In modern coding models, it is necessary to know how a character maps to bits in a computer, following several steps.

Knowing what characters a system needs to support, a collection of these characters is called a character set (Character repertoire)
Sets a number for the abstract character in the character table, which is the mapping of the character set to an integer collection. This mapping is called the coded character set (ccs:coded Character set), and Unicode is a concept that belongs to this layer, and it has nothing to do with what's in the computer, it's completely mathematical abstraction.
Converts the integer corresponding to the character in the CCS to a finite-length bit value, making it easier for the computer to represent the integer in a certain length of binary form later. This correspondence is called the character encoding table (Cef:character Encoding form) UTF-8, UTF-16 all belong to this layer.
The bit values that are obtained for CEF are specifically stored and transmitted in the computer. Because there is a problem with a small end, this will be related to the specific operating system. This solution is known as the character encoding scheme (Ces:character Encoding scheme).

The code we're talking about is done in the third step, and it doesn't involve CES. So CES is not within the scope of this discussion.
Now, someone might wonder why they have a modern coding model? Why do we have to split so many concepts in the current coding model? Directly like the original coding model directly all the information is not OK? These problems will be explained in the history of coding.

History of coding ASCII

ASCII appeared in the 60 's in the United States, ASCII defines a total of 128 characters, using a byte of 7 bits. These characters are defined including the English alphabet a-z,a-z, number 0-9, some punctuation and control > symbols. Enter in the shell to man ASCII see the full ASCII character set. The encoding model used in ASCII is a simple character set, which directly defines a character's bit-value representation. For example, as mentioned above A -> 0100 0001 . That is, ASCII directly completes the first three steps of the modern coding model.
The ASCII standard is perfect in English-speaking countries. But don't forget that there are thousands of different languages in the world, not only these symbols in these languages. If people who use these languages also want to use computers, ASCII is far from enough. Coding into the chaotic era.

Chaotic times

People know that a byte of a computer is 8 bits and can represent 256 characters. ASCII used only 7 bits, so people decided to use the rest of them. When the problem arises, people have no objection to the 128 characters that are already set, but the need for other characters is not the same for the different language families, so the expansion of the remaining 128 characters will be a strange one. And even more confusing, there are more characters in the Asian language system, and a byte can no longer satisfy the demand. For example, there are more than 100,000 Chinese characters, and how can a 256 representation of a byte be satisfied? The result is a variety of multibyte representations of a character method (GBK is one of them), which makes the whole situation even more chaotic. (Want to see here you no longer think of a byte is a character, a character is 8 bits). Each language family has its own specific code pages, so that different languages appear on the same computer, and people of different languages communicate on the Internet. At this point, Unicode appears.

Unicode

Unicode is assigning a code name to all the characters in the computer. What is Unicode popular for? It is now the realization of Communism, people do not need their own specific national identity card, but to each person a universal identity card. Unicode is a range of coded character sets (CCS). What Unicode does is to map each character in the character table that we need to represent into a number, which is called the code point of the corresponding character. For example, the code point of the word "Yan" in Unicode is u+0x4e25.

So far, we've just found a mapping between a bunch of characters and numbers, only to the level of CCS. How these numbers are stored and displayed in computers and networks is not yet mentioned.

Character encoding

The front is also the concept of character sets, and now finally to the level of CEF. In order to facilitate the storage and processing of the calculations, we now have to translate which pure mathematical numbers into a finite-length bit value. The most intuitive design is of course a character of the code point is what number, we will convert this number to the corresponding binary representation, such as "strict" in Unicode corresponding to the number is 0x4e25, his binary is 100 1110 0010 0101 , that is, strict this word requires two bytes to be stored. In this way, the majority of Chinese characters can be expressed in two bytes. But there are other languages, and maybe the characters they use will need 4 bytes to convert in this way. That's the problem. How many bytes should be used to represent a character? If you specify two bytes, some words inode not come out, if more than the number of bytes to represent a character, many people do not agree, because some language of the character two byte processing can be, with what with more bytes to express, how wasted.

Would you like to use a variable-length byte to store a character? If you use a variable-length byte to represent a character, you have to know that a few bytes represent a character, or the computer may not be that smart.
Ming. The following describes the design of the most commonly used UTF-8 (UTF is the abbreviation for Unicode Transformation format). See (from Nanyi's blog)

x indicates the available bits

The corresponding code point of each character in Unicode can be converted to the binary representation of the corresponding computer by UTF-8 correspondence. It can be found that the conversion according to UTF-8 is completely compatible with the original ASCII, and when multibyte represents a character, a few 1 at the beginning indicate that the character is represented by a few bytes after the UTF-8 conversion. The following is an example of a blog from Nanyi

Known as "Strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes,
That is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code > is "11100100 10111000 10100101", converted into 16 binary is 0xe4b8a5.

In addition to UTF-8 this conversion method, there are utf-16,utf-32 and so on conversion methods. There is no more introduction here. (Note that the number behind UTF represents the size of the code element.) A code unit is a unit of encoded text that has the shortest bit combination. For UTF-8, the code element is 8 than the special features, for UTF-16, the code element is 16 than the special features. In other words, the UTF-8 is in the smallest unit of a byte, and the UTF-16 is in the smallest unit of two bytes. ）

The Mystery of character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More