Understanding character encodings

Source: Internet
Author: User
Tags uppercase letter

First, preface

In solving yesterday's problem, but also raised a lot of new questions, such as why to encode, the relationship between these encodings, such as the relationship between the Ascii,ios-8859-1,gb2312,gbk,unicode, I want to thoroughly understand the character encoding behind the story, thus carried out the exploration, The specific notes are as follows. If the friends can read this article, I believe that will solve a lot of doubts.

Second, character encoding

2.1 Why do I need to encode?

We know that all the information is ultimately represented as a binary string, with each bits (bit) having 0 and 12 states. When we need to put the character ' a ' into the computer, which state should be corresponding to, when stored, we can be the character ' a ' with 01000010 (this casually compiled) binary string representation, to the computer, read, and then restore 01000010 to the character ' A '. So the question is, when storing, the character ' a ' should correspond to which string binary number, is 01000010? Or 10000000, 11110101? To be blunt, a rule is needed. This rule can map characters to the only state (binary string), which is the encoding. And the first coding rule is ASCII encoding, in the ASCII encoding rules, the character ' A ' does not correspond to 01000010, also does not correspond to 1000 0000 11110101, but corresponds to 01000001 (do not ask why, this is the rule).

2.2 ASCII

This set of coding rules is customized by the United States, a total of 128 character encoding, such as a space "space" is 32 (decimal) (binary 00100000), uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) occupy only one byte (8 bit) behind 7 bits, and the first 1-bit uniform is 0. A total of 128 character encodings, one byte is not used up, which seems to be a little too little. Thus, the highest bit is started, and it is encoded at 1 o'clock, and the highest bit encoding is called non-ASCII encoding, such as iso-8859-1 encoding.

2.3 iso-8859-1

This set of coding rules is set by the ISO organization. On the basis of the ASCII code has also developed some standards to extend the ASCII encoding, that is, 00000000 (0) ~ 01111111 (127) as ASCII encoding, 10000000 (128) ~ 11111111 (255) This paragraph is encoded, such as the character § encoded into 10100111 (167). The ISO-8859-1 encoding is also a single-byte encoding that can represent up to 256 characters. Latin1 is the nickname of Iso-8859-1, writing Latin-1 in some circumstances. However, even if it is possible to represent 256 characters, there is too little for Chinese, and one byte is definitely not enough and must be expressed in more than one byte. However, because it is a single-byte encoding that is consistent with the computer's most basic representation, many times it is still represented by iso8859-1 encoding. And on many protocols, the code is used by default. For example, although "Chinese" two words do not exist iso8859-1 encoding, in GB2312 encoding as an example, should be d6d0 CEC4 two characters, use iso8859-1 encoding when it is opened to 4 bytes to represent: d6d0 CEC4 (in fact, in the storage of the time, is also handled in bytes). In the case of UTF encoding, it is 6 bytes E4 B8 ad E6 96 87. Obviously, this representation needs to be based on another encoding to display correctly. And the common Chinese encoding method has GB2312, BIG5, GBK.

2.4 GB2312

GB2312 its "Partition" processing of the included characters, a total of 94 zones, the zone from 1 (decimal), the beginning of the region to 94 (decimal), each zone contains 94 bits, the bit from 1 (decimal) Start, all the way to 94 (decimal), a total of 8836 (94 * 94) code, this representation is also known as location code , GB2312 is a double-byte encoding in which the high-byte representation area, low-byte representation bits. The districts are specified as follows:

The 01-09 zone contains 682 characters except the Chinese characters, with 164 vacancies (9 * 94-682). The 10-15 districts are empty areas, which are not used. Area 16-55 contains 3,755 first-level Chinese characters (simplified), sorted by pinyin. Area 56-87 contains 3,008 two-level Chinese characters (simplified), sorted by radical/stroke. The 88-94 districts are empty areas, which are not used.

So according to the location code how to calculate the GBK2312 code it? The location code represents a range of 0101-9494 (contains an empty location code). Click here to view the GB2312 coded location code. Then just follow the rules below to convert.

1. Convert the Zone (decimal) to 16 binary.

2. Add the hexadecimal A0 of the conversion and get the GB2312 encoded high byte.

3. Convert the bit (decimal) to 16 binary.

4. Add the hexadecimal of the conversion to A0 and get the low byte encoded by GB2312.

5. Combination zone and bit, zone at high byte, bit in low byte.

6. Get GB2312 encoded.

The specific flowchart is as follows:

For example: The location code of the word ' Li ' is 3278 (indicated in the 32 zone, 78 bits). 1. Convert 32 (Zone) to 16 in 20. 2. Add A0 to C0. 3. Convert 78 (BITS) to 16 in 4E. 4. Add A0 to EE. 5. Combination zone and bit, for C0ee. 6. Get GB2312 code, that is, the word ' li ' GB2312 encoded as c0ee.

The GB2312 is encoded in two bytes with a partition encoding, with a total of 6763 (3755 + 3008) of the Chinese number encoded. These characters are the most commonly used Chinese characters, and have covered 99.75% of the Chinese mainland's usage frequency. However, there are some Chinese characters in the GB2312 is not encoded, such as the ' Rong ' word, in the GB2312 is not encoded, which leads to problems, followed by the emergence of mainstream GBK coding. Before we explain the GBK code, we'll explain the BIG5 code.

2.5 BIG5

The BIG5 uses a double-byte encoding that uses two bytes to represent one character. The high bit byte uses the 0X81-0XFE, the low byte uses the 0x40-0x7e, and the 0xa1-0xfe. The code is a traditional Chinese character set encoding standard, a total of 13,060 characters, including two words for repeated coding, namely "Wu, Wu" (A461 and c94a) and "嗀, 嗀" (DCD1 and DDFC). The specific partitions are as follows:

8140-A0FE reserved for user-defined characters (lettering area) A140-A3BF punctuation marks, Greek letters, and special symbols. Among them, in a259-a261, included the weights and measures unit with the word: 兙 兛 兞 兝 兡 兣 嗧 kw 糎. A3C0-A3FE reserved. This area is not open for use as a word-making area. a440-c67e commonly used Chinese characters, first by stroke and then by radical sort. C6A1-F9DC other Chinese characters. F9dd-f9fe tab.

Click here to view BIG5 encoding. Note that BIG5 encoding has nothing to do with GBK encoding.

2.6 GBK

GBK encoding extends the GB2312, fully compatible with GB2312 encoding (such as the ' Li ' gbk, GB2312 encoding is c0ee), but its incompatible BIG5 encoding (the ' long ' word BIG5 encoded AAF8,GBK encoded as e94c, ' Li ' BIG5 encoded as A7f5 Not equal to C0ee), that is, if the use of GB2312 encoding, using GBK decoding is completely normal, but if the use of BIG5 encoding, using GBK decoding, there will be garbled. Compared to GB2312 encoding, GBK encodes more Chinese characters, such as the word ' rong '. GBK encoding is still using a double-byte encoding scheme, its encoding range: 8140-fefe, eliminate xx7f code bit, a total of 23,940 code bits. can represent 21,003 characters. Click here to view GBK encoding. Click here to inquire about other codes in Chinese. After GBK again appeared the GB18030 code, but did not form the mainstream, so do not introduce, so far, the problem of Chinese coding has been explained to complete. Then the question comes again, mainland netizens and in the cross-Strait Internet users, if all use GBK code, then no problem, if one party uses GBK code, one party uses BIG5 code, then there will be garbled problems, this is in the cross-strait Internet communication, if the cross-Sea exchange? That would be more prone to garbled problems, at this time we may think, if there is a universal code for the world is good, do not worry, such a code does exist, that is Unicode.

2.7 Unicode

There are two separate attempts to create a single character set. One is the ISO 10646 project of the International Organization for Standardization (ISO) and the other is a Unicode project organized by a consortium of multilingual software manufacturers. Around 1991, participants in two projects realized that the world did not need two different single character sets. They combine the work of both parties and work together to create a single coding table. Two projects still exist and publish their own standards independently, but both the Unicode Association and ISO/IEC JTC1/SC2 agree to maintain the Unicode-compliant code-table compatibility and to align closely with any future extensions.

Unicode refers to a table that contains all the characters that may appear, each character corresponds to a number, which is called a code point, such as the code point of the character ' H ' is 72 (decimal), and the character ' Li ' has a code point of 26446 (decimal). The Unicode table contains 1,114,112 code points, which is 000000 (hexadecimal)-10FFFF (hexadecimal). All characters on the earth can find the corresponding unique code point in the Unicode table. Click here to query the code points for the characters. Unicode divides the code space into 17 planes, from 00-10 (16, up to two bits), that is, from 0-16 (decimal), each plane has 65,536 code points (2^16), the most important of which is the first Unicode plane (code bit from 0000-FFFF), Contains the most commonly used characters, which are known as the basic multi-language plane (basic multilingual Plane), abbreviated to BMP, and other planes called the auxiliary Plane (supplementary Planes), in the basic multilingual plane, The code segment from D800 to DFFF is permanently reserved and is not mapped to characters, so the UTF-16 encoding cleverly leverages this retained code point to encode characters in the auxiliary plane, which is explained later. Unicode is just a set of symbols, only the code points corresponding to the specified characters, and does not specify how to store, how to store there is a different encoding scheme, about the Unicode encoding scheme has two main main lines: UCS and UTF. UTF mainline is maintained and managed by the Unicode Consortium, and the UCS mainline is maintained and managed by ISO/IEC.

2.8 UCS

UCS is all called "Universal Character Set" and there are mainly UCS-2 and UCS-4 in UCS.

1. UCS-2

The UCS-2 is a fixed-length byte encoded using 2 bytes, from 0000 (hexadecimal)-FFFF (hexadecimal) code-bit range corresponding to the first Unicode plane. Using the BOM (Byte Order Mark) mechanism, the mechanism functions as follows: 1. Determines whether the byte stream is in the big-endian or small-order. 2. Determine the Unicode encoding scheme for the byte stream.

2. UCS-4

The UCS-4 is a fixed-length byte and is encoded using 4 bytes. The BOM mechanism is also used.

2.9 UTF

UTF is all called "Unicode transformation Format", and there are mainly utf-8,utf-16 and UTF-32 in UTF.

1. UTF-8

UTF-8 is a variable-length encoding method that uses 1-4 bytes to encode. UTF-8 is fully compatible with ASCII, and for characters in ASCII, the encoded value used by UTF-8 is exactly the same as ASCII. UTF-8 is a specific encoding implementation of Unicode. UTF-8 is the most widely used Unicode encoding rule on the Internet because it helps to conserve network traffic (because of the variable-length encoding, not the uniform length encoding). For how Unicode code points are converted to UTF-8 encoding, you can refer to the following rules:

① for single-byte symbols, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

② for N-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The following coding rules are summarized:

      Unicode Symbol Range                 |   UTF-8 encoding Method     (hexadecimal) (decimal)        | (binary)----------------------------------------------------------------------------------------------------  0000 0000-0000 007F (0-127)           |    0xxxxxxx  0000 0080-0000 07FF (128-2047)        |  110xxxxx 10xxxxxx  0000 0800-0000 FFFF (2048-65535)      |      1110xxxx 10xxxxxx 10xxxxxx  0001 0000-0010 FFFF (65536-1114111)   | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Description: The Unicode code point for the character ' a ' is 65 (decimal), according to the table above, in the first row range, the character ' a ' of the UTF-8 encoding is 01000001, the Chinese characters ' Li ' Unicode code point is 26446 (decimal), Binary is 01100111 01001110, hexadecimal is 674E. According to the above table, in the third row range, the ' Li ' binary code from low to high in order to fill in the X, not enough to fill in 0. Get UTF-8 encoded as 11100110 10011101 10001110, which is e69d8e (hex).

By the above coding rules, 0000 0000-0000 FFFF (first line to third row) is the Unicode first plane (basic multilanguage Plane), while 0001 0000-10 FFFF (line Fourth) is a Unicode other plane (secondary plane). The basic multi-lingual plane corresponds to the most commonly used characters. For code points greater than 65535 (decimal), which is the code point on the secondary plane, 4 bytes are required for UTF-8 encoding.

2. UTF-16

The UTF-8 is an indefinite encoding that uses 1, 2, 3, 4 byte encodings, while UTF-16 uses only 2 or 4 bytes of encoding. UTF-16 is also a specific encoding implementation of Unicode. The rules on how Unicode translates to UTf-16 encoding are as follows

① If the Unicode code point is in the first plane (BPM), it is encoded with 2 bytes.

② If the Unicode code point is in a different plane (the secondary plane), it is encoded with 4 bytes.

The code point encoding of the auxiliary plane is analyzed in more detail as follows: The auxiliary plane code point is encoded as a pair of 16-bit (four-byte) long code elements, called the proxy pair (surrogate pair), the first part is called the High-level agent (surrogate) or the leading agent (lead Surrogates), the code bit range is: D800-DBFF. The second part is called the low surrogate or rear agent (trail surrogates), and the Code bit range is: dc00-dfff. Note that the code point of the high-level agent from D800 to DBFF, and low-level proxy code points from DC00 to DFFF, a total of exactly d800-dfff, this part of the code in the first plane is reserved, does not map to any character, So the UTF-16 code cleverly exploits this point to encode 4-byte code points within the auxiliary plane.

Description: The Unicode code point for character ' A ' is 65 (decimal), hexadecimal is 41, in the first plane. According to the rules, UTF-16 is encoded with 2 bytes. So the problem comes again, knowing that two byte encoding is used, and we know that the computer is stored in bytes, and that two bytes should be represented as 00 41 (hexadecimal)? Or is it 41 00 (hex)? This leads to a problem that needs to be solved by using the BOM mechanism mentioned earlier.

The representation of 00 41 means that the big endian is used, whereas the representation of 41 00 means that the small end sequence is used. So how does the computer know if the character information stored is big endian or small end? This involves adding some control information, in the case of big-endian, by adding FE FF in front of the file, and using the small end sequence, then adding FF FE before the file. Thus, when the calculation begins to read, the first two bytes are found to be FE FF, which means that the information is followed by a small end order, and vice versa, it is the big endian.

Character (cannot be displayed, can only be displayed), its Unicode code point is 65902 (decimal), hexadecimal is 1016E, it is clear that the first plane (BMP) can be represented by the range. Within the auxiliary plane, according to the rules, the UTF-16 is encoded with 4 bytes. However, the encoding is not simply extended to 4 bytes (6E), but is calculated using the following rules.

The ① uses a Unicode code bit minus 100000 (hexadecimal), the resulting value expands 20 bits (because Unicode is maximum of ten FF FF (hex), minus 1 00 00 (hexadecimal), resulting in a maximum of 0FFF ff (hexadecimal), that is 20 bits, less than 20 bits of , add a 0 to the high and expand to 20 bits).

② the 20 bits obtained by step one, which is segmented according to the high 10 bit and low 10 bits.

③ expands the high 10 bits of step two to 2 bytes, plus D800 (hex), to get a high-level proxy or a leading proxy. The value range is D800-0XDBFF.

④ expands the lower 10 bits of step two to 2 bytes, plus DC00 (hex), to get the low agent or rear agent. The value range is dc00-0xdfff.

The Unicode to UTF-16 rule flowchart is as follows:

According to this rule, we calculate the UTF-16 encoding of the character, we know its code point is 1016E, minus 10000 to get 016E, expand to 0016E, split, get high 10 bits 00 0000 0000, hex 0000, plus D800 for D800 , the lower 10 bits are 01 0110 1110, the hexadecimal is 016E, plus DC00 is dd6e, and the synthesis gets D8, DD 6 E. That is, the UTF-16 encoding is D8 DD 6E (also available as D8 0 dd 6E).

For UTF-32 is the use of 4 bytes, but also using the BOM mechanism, can be analogous to UTF-16, here no additional introduction.

Four, character encoding differences

The difference between 4.1 UCS-2 and UTF-16

From the above analysis, it is known that the UCS-2 uses two bytes to encode. Within the code-bit range of 0000 to FFFF, it is basically consistent with the UTF-16, since the code bit in UTF-16 from u+d800 to U+DFFF does not correspond to any character, whereas in the era of UCS-2, the values in u+d800 to U+DFFF are occupied.

UCS-2 can only represent a code point within a BMP (only 2 bytes), while UTF-16 can represent a code point within the auxiliary plane (in 4 bytes).

We can abstract that UTF-16 can be seen as the parent set of UCS-2. In the absence of auxiliary plane characters (surrogate code points), UTF-16 is basically the same as what UCS-2 means. However, when the auxiliary plane character is introduced, you can only use UTF-16 encoding when you want to represent the auxiliary plane character.

The difference between 4.2 UCS-4 and UTF-16

On BMP, the UTF-16 is represented by 2 bytes, while on the secondary plane, UTF-16 is represented by 4 bytes. For UCS-4, the four-byte representation is used on whichever plane.

4.3 Why UTF-8 encoding does not require BOM mechanism

Because in the UTF-8 code, its own control information, such as 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx, where 1110 plays a controlling role, so there is no need for additional BOM mechanism.

V. Summary

If the reader has the patience to see this, I believe there is no doubt about this piece of character encoding. Write here, on the completion of the mainstream coding exploration, the process of exploration is really not easy, finally made clear, feel quite happy, but also the experience to share to the friends of the park, if the reader has any questions, welcome to exchange, thank you all the garden friends to watch ~

Understanding character encodings

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.