Unicode encoding and Its Implementation: UTF-16, UTF-8, and more

Source: Internet
Author: User
Tags ranges

Tian haili @ csdn

2012-04-25

This paper mainly discusses unicode encoding and various implementations, focuses on UTF-16, UTF-8 implementation rules, as well as big-Endian and little-Endian storage order.

I. Unicode encoding

Before Unicode appeared, there were various encoding standards: ANSI, ISO8859-1, gb2312, GBK and big-5. Unicode tries to unify various encodings. In the process of Unicode evolution, it also has its own repair process: At the beginning, it expressed 65535 characters in 16 bits, and thought it was enough to collect all the characters; later, with the addition of a large number of Chinese, Korean, and Japanese ideographic texts, the number of characters exceeds 65535, and 16 characters are no longer able to describe all character sets.

The code value corresponding to a character in the Unicode Character SetCode Point), Written in hexadecimal notation with the U + prefix. For example, the Code of 'tag' is u + 7530; The Code of 'A' is u + 0041.

The Unicode-defined character set contains more than 16 characters, and all these codepoints are divided into 17Code plane):

  • U + 0000 ~ Basic multilingualplaneBMP);
  • The rest are divided into 16 secondary planes (supplementary plane), with the code point range U + 10000~ U + 10FFFF.

Although this division is done, not all code points in each plane have proper characters, which are reserved and have special purposes.

Ii. Unicode encoding implementation

Unicode is implemented in a different way than encoding.The Unicode encoding of a character isOKBut in the actual storage and transmission process, because the design of different system platforms is not necessarily consistent, and for the purpose of saving space, UnicodeThe encoding implementation methods are different.. Unicode conversion format (Unicode Transformation Format)UTF).

Unicode encoding mainly include UTF-16BE, UTF-16LE, UTF-8, UTF-7 and UTF-32 and other implementation methods, the current commonly used implementation methods are UTF-16LE, UTF-16BE and UTF-8.

2.1 UTF-16

The UTF-16 is encoded with 16 bits to express Unicode, so the expression range is 216 (65536), that is, the code unit of the UTF-16 is 16 bits. If the characters in BMP are expressed, the code unit of a UTF-16 can be expressed, and the UTF-16 has a clever design for the characters in the secondary plane.

In BMP, the Code Point segment from u + d800 to U + dfff is permanently retained and not mapped to characters, the UTF-16 uses this reserved 0xd800-0xdfff segment's codepoint to encode the character's Code Point in the secondary plane.

Encode U + 0000 .. u + d7ff and U + e000.. U + FFFF

The UTF-16 and UCS-2 encode the codepoint in this range, using a single 16-bit long codeunit, the value is equivalent to the corresponding code point. These code points in BMP are the only code points that can be represented by a UCS-2.

Encoding of U + 10000 .. u + 10ffff

The codepoint in the lementary planes, which is encodedOne pairA 16-bit code unit (32bit, 4 bytes) is calledProxy pair(Surrogate pair ).

 

The specific method is:

  1. Code point minus 0x10000The obtained value is 20 bits (0 .. 0 xfffff );
  2. Step 1 obtain the value of 10 bits (value range: 0 .. 0x3ff) is added with 0xd800 to get the first code unit, which is also called high surrogate or lead surrogate ). The value range is 0xd800 .. 0 xdbff.
  3. Step 1 obtain the value of 10 bits in the low position of the value (the value range is 0 .. 0x3ff) is added with 0xdc00 to get the second code unit, which is also called low surrogate or trail surrogate ). The value range is 0xdc00 .. 0 xdfff.

In this way, the characters in this range are encoded as a proxy for [lead surrogate, trail surrogate]: Two 16bits code units, with the value ranges of 0xd800 .. 0xdbff and 0xdc00 .. 0 xdfff. The range of the Code unit obtained in BMP is 0x0000 .. 0 xFFFF (0xd800 .. 0xdfff is retained and does not include it). Therefore, these three segments do not overlap with each other and are easy to implement during decoding.

The correspondence between the code unit pair and the Code Point obtained by UTF-16 decoding [High proxy + low proxy] is shown in the following table.

Decoding UTF-16

Hi \ Lo

Dc00

DC01

...

Dfff

D800

10000

10001

...

103ff

D801

10400

10401

...

0000ff

Bytes

Bytes

Bytes

Bytes

Bytes

Dbff

10fc00

Fc1001

...

10 FFFF

 

 

 

 

 

 

 

 

 

 

The following takes the U + 64321 UTF-16 encoding as an example, to see how the characters in the secondary plane is encoded:

 

V = 0x64321vx = V-0x10000 = 0x54321 = 01010100 0011 0010 0001 consecutive H = 01 0101 0000 // 10 bitsvl of the high part of VX = 11 0010 0001 // VX 10 bitsw1 = 0xd800 // the first 16-bit initial value of the result W2 = 0xdc00 // the last 16-bit initial value of the result W1 = W1 | VL = 1101 1000 0000 0000 | 01 0101 0000 = 1101 1001 0101 0000 = 0xd950 W2 = W2 | VL = 1101 1100 0000 0000 | 11 0010 0001 = 1101 1111 0010 0001 = 0xdf21

 

So the final UTF-16 code for the word U + 64321 is:

0xD950 0xDF21

The code unit for the UTF-16 is 16 bits, two bytes. When you store a code unit, there is also the sequential access problem, that is, the endian problem, which is described in the following chapter.

 

2.2 UTF-8

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that uses one to four bytes for each character encoding:

  • Unicode ranges from u + 0000 .. u + 007f. Only one ASCII character is required;
  • Unicode characters in the range of U + 0080 .. u + 07ff must be encoded in two bytes;
  • Other BMP characters whose Unicode range is u + 0800 .. u + FFFF (which contains most common words) are encoded in three bytes;
  • Unicode secondary flat characters (Other rarely used characters) are 4-byte encoded.

For the fourth character mentioned above, the UTF-8 uses four bytes for encoding seems too resource-consuming. However, UTF-8 only uses three bytes to express all common characters, and UTF-16 encoding also requires four bytes to encode the fourth character, and if it is an ASCII character, UTF-8 can greatly save storage space. UTF-8 has gradually become an application of e-mail, web pages and other storage or text transfer, the preferred encoding. Internet Engineering team (IETF)RequirementsAll Internet protocols must support UTF-8 encoding. Internet Mail Alliance (IMC)SuggestionsAll email software supports UTF-8 encoding.

The rules for UTF-8 encoding for characters in various ranges of codepoint are as follows:

Code Point

UTF-8 byte stream

U+ 00000000-U + 0000007f

0Xxxxxxx

U+ 00000080-U + 000007ff

110XXXXX10Xxxxxx

U + 00000800-U + 0000 FFFF

1110Xxxx10Xxxxxx
10
Xxxxxx

U+ 00010000-U + 001 fffff

11110Xxx10Xxxxxx
10
Xxxxxx10Xxxxxx

The section between U + d800 and U + dfff does not have specific characters in the definition of the Unicode Character Set, and is used to encode the character of the secondary plane in the UTF-16 encoding.

The following uses "Tian" (Code point is u + 7530) as an example to see how to encode it in UTF-8:

  • U + 7530 falls in the U + 0800 .. u + FFFF range, using three-byte encoding;
  • Convert 0x7530 to binary 111
    010100 110000;
  • In the input table, 1110 is returned.01111001010010110000;

In this way, get the UTF-8 code of "Tian" (U + 7530): 0xe7 94 B0.

Knowing the encoding rules of the UTF-8, we can perform the following Decoding for any byte B in the UTF-8 encoding:

  • If the first digit of B is 0, B is an ascii code, and B is independentOne characterIf B is 1 and 0, B is one of the non-ASCII characters (represented by multiple bytes, it is not the first byte of the character (the post encoding other than the first byte of the character );
  • If the first two digits of B are 1 and the third digit is 0, B is the first byte of a non-ASCII character (represented by multiple bytes, this character is expressed in two bytes;
  • If the first three digits of B are 1 and the fourth digit is 0, B is the first byte of a non-ASCII character (represented by multiple bytes, and the character is represented by three bytes;
  • If the first four digits of B are 1 and the fifth digit is 0, B is the first byte of a non-ASCII character (represented by multiple bytes, and the character is expressed in four bytes.
2.3 UCS-2 vs UTF-16, UCS-4 vs UTF-32

UCS-2Each character occupies 2 bytes.UCS-2 is a subset of UTF-16. Before there is no secondary plane, UTF-16 and UCS-2 refer to the same meaning. However, when the secondary Flat Characters are introduced, the UTF-16 adds support for the characters in the secondary flat. Now if software claims that it supports UCS-2 encoding, it actually means that it does not support a word set that exceeds 2 bytes in the UTF-16. That is, the UTF-16 code is equal to the UCS code for the UCS code less than 0x10000. Java earlier versions of Unicode support, just UCS-2 support, now added to the UTF-16 of the complete support.

 

UCS-4AndUTF-32The meaning is the same. Each character uses 4 bytes (31 character sets, plus the first character set with a constant of 0, a total of 32 characters are required ). Theoretically, it can contain up to 231 characters, which can cover all the symbols used by languages. Although each code point is easy to use with fixed length bytes, it is a great waste of space for character sets that only require 2 bytes to store common words, no application is available.

3. Big-Endian/little-Endian and BOM

Speaking of the UTF-16 encoding method, the UTF-16 encoding code unit is 2 bytes, the two bytes in the transmission and storage process, the high/low position is different, is different characters. For example, the UTF-16 code of "Tian" is 0x7530, but if it is saved as 0x3075, it becomes "bytes" and becomes another character.

Therefore, special characters must be used to identify the storage sequence of encoded characters. Unicode charactersU + feffUsed to indicate the storage order, called byte order mark (BOM).

  • Big-Endian: A large-bit address stores high-bit bytes, which can be called high-bit priority. The memory is stored sequentially from the lowest address (high-digit numbers are written first ). Put the highest byte at the beginning.
  • Little endian: The second-bit address stores low-bit bytes, which can be called low-bit priority. The memory is stored sequentially from the lowest address (low-bit numbers are written first ). The forward byte is placed at the top.

Bom is stored as Fe Ff on the big-Endian system, and FF Fe on the big-Endian system. So at the beginning of the file of the UTF-16 (UTF-16BE) stored in big-EndianFeffIndicates; starts with a file of a UTF-16 (UTF-16LE) stored in little-Endian,Fffe.

Bom UTF-8 code is 111011111011101110111111(EF BB BF), So ef bb bf is generally placed at the beginning of the text to indicate its encoding as a UTF-8.

Iv. Unicode coding practices


On the Windows text editing tool notepad, you can select different encoding options when you select "Save as". The corresponding encoding options include "ANSI" and "Unicode ", unicode big endian, and UTF-8 ". Because Windows stores little-Endian, Unicode, Unicode big endian corresponds to UTF-16LE and UTF-16BE, respectively.

Readers can try to write a string of characters and save them with different codes. Then, they can use a hexadecimal plain text editing tool (for example, ultra-edit) to test the encoding implementation and storage sequence. The following is the result of saving "Tian haili (U + 7530, U + 6d77, U + 7acb)" in different encoding methods:

Tian haili _42516be.txt

Feff75306d777acb

Tian haili _42516le.txt

Fffe3075776dcb7a

Tian haili _00008.txt

EfbbbfE794b0e6b5b7e7ab8b

For clarity,BOMThe encoding is marked in bold; the field encoding is marked in red; the sea encoding is marked in green; the vertical encoding is marked in blue. As you can see, the start position of the Unicode-encoded file stored in Notepad (Notepad) indicates the encoding format with the corresponding BOM encoding.

[Postscript] History

I recently used the Unicode encoding implementation method and collected some information. I found that as early as, I was prepared to summarize the Unicode coding implementation, and the document also has an outline. I still don't remember the reason for the delay. Fortunately, the sum is summarized in a timely manner. A good brain is worse than a bad pen. If it is summarized at the beginning, you do not have to waste time collecting data.

We hope that this summary will be complete and Unicode encoding will be used later. You just need to refer to this article! (Of course, the premise is that the Unicode Standard does not evolve into ^_^)

[Appendix] comparison of basic concepts

  1. Code point or code bit
  2. A code unit is a unit with the shortest bit combination in an encoded text. For the UTF-8, the bitwise is 8 bits long; For the UTF-16, the bitwise is 16 bits long; For the UTF-32, the bitwise is 32 bits long.
  3. BMP-Basic multilingual plane
  4. UTF-Unicode Transformation Format
  5. Bom-byte order mark
  6. Ucs-universal Character Set

 

Reprinted from:

Http://blog.csdn.net/thl789/article/details/7506133

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.