Unicode detailed analysis and explanation

Source: Internet
Author: User

This statement can be reproduced at will, but the original author charlee and original link http://tech.idv2.com/2008/02/21/unicode-intro/must be indicated during reprinting.

    • Basic knowledge

      • Differences between byte and character
      • Big endian and little endian
    • UCS-2 and UCS-4
    • UTF-16 and UTF-32
      • UTF-16
      • UTF-32
    • UTF-8
Basic knowledge

Before introducing Unicode, we should first explain some basic knowledge. Although it has no direct relationship with Unicode, it is not possible to understand Unicode.

Differences between byte and character

What is the difference between byte and character? Are they all the same? Completely correct, but only in the old dos era. When Unicode appears, the bytes and characters are different.

The byte (octet) is an eight-bit storage unit. The value range must be 0 ~ 255. Character (character, or word) is a symbol in the language sense, and the range is not necessarily. For example, the character range defined in the UCS-2 is 0 ~ 65535. One character occupies two bytes.

Big endian and little endian

As mentioned above, a single character may occupy multiple bytes. How can these bytes be stored on a computer? For example, if the character is 0 xabcd, is its storage format AB cd or CD AB?

In fact, both are possible and have different names. If it is stored as AB CD, it is calledBig endianIf it is stored as cd AB, it is calledLittle endian.

Specifically, the following storage format is big endian, because the high (0xab) value (0 xabcd) is stored in the front:

Address Value
Zero X 00000000 AB
Zero X 00000001 CD

Conversely, the following storage format is little endian:

Address Value
Zero X 00000000 CD
Zero X 00000001 AB
UCS-2 and UCS-4

Unicode was born to integrate all the languages of the world. Any text corresponds to a value in Unicode. This value is calledCodePoint(Code point ). The value of the Code point is generally written in the U + ABCD format. The correspondence between text and code points isUCS-2(Universal Character Set coded in 2 octets ). As the name suggests, UCS-2 is to use two bytes to represent the code point, its value range is u + 0000 ~ U + FFFF.

In order to express more words, people proposed a UCS-4, that is, four bytes to represent the code points. Its range is u + 00000000 ~ U + 7 fffffff, where u + 00000000 ~ U + effecffff and UCS-2 are the same.

Note that the UCS-2 and UCS-4 only specify the correspondence between the code points and the text, and do not specify how the code points are stored in the computer. The storage method is calledUTF(Unicode Transformation Format), where the application is more UTF-16 and UTF-8.

UTF-16 and UTF-32UTF-16

The UTF-16 is defined by rfc2781, which uses two bytes to represent a Code Point.

It's hard to guess that the UTF-16 exactly corresponds to the UCS-2, that is, the code points specified by the UCS-2 are saved directly through the big endian or little endian approach. UTF-16 includes three types: UTF-16, UTF-16BE (big endian), UTF-16LE (little endian ).

UTF-16BE and UTF-16LE are hard to understand, and UTF-16 needs to indicate whether the file is big endian or little endian by starting with a character named BOM (byte order mark. Bom is the character U + feff.

Bom is actually a clever idea. Since the UCS-2 does not define U + fffe, as long as there is a byte sequence like FF Fe or Fe ff, it can be considered U + feff, in addition, you can determine whether it is big endian or little endian.

For example. The three characters "ABC" are encoded in various ways and the result is as follows:

UTF-16BE 00 41 00 42 00 43
UTF-16LE 41 00 42 00 43 00
UTF-16) Fe ff 00 41 00 42 00 43
UTF-16 (little endian) FF Fe 41 00 42 00 43 00
UTF-16 (without BOM) 00 41 00 42 00 43

The default Unicode encoding for Windows is the little endian UTF-16 (that is, the FF Fe 41 00 42 00 43 00 above ). You can open notepad, write ABC, save it, and use the binary editor to view its encoding result.

In addition, the UTF-16 can also represent a part of the UCS-4 Code Point-U + 10000 ~ U + 10 FFFF. IndicatesAlgorithmIt is complex and simple to describe as follows:

    1. Subtract 0x10000 from the Code Point U to get U '. In this way, the U + 10000 ~ U + 10ffff becomes 0x00000 ~ 0 xfffff.
    2. U' is expressed by 20 bits '. U' = yyyyyyyyxxxxxxxx
    3. Use W1 and W2 to represent the first 10 digits and the last 10 digits. If W1 = 110110 yyyyyyyyyy and W2 = 110111 xxxxxxxxxx, then W1 = d800 ~ Dbff, W2 = dc00 ~ Dfff.

For example, U + 12345 represents D8 08 DF 45 (UTF-16BE), or 08 D8 45 DF (UTF-16LE ).

However, due to the existence of this algorithm, the U + d800 ~ U + dfff becomes an undefined character.

UTF-32

The UTF-32 represents the points of code in four bytes, so that all the points of code for the UCS-4 can be fully expressed without using complex algorithms as the UTF-16 does. Like a UTF-16, A UTF-32 also includes three encodings: UTF-32, UTF-32BE, and UTF-32LE, which also requires BOM characters. Use 'abc' only for example:

UTF-32BE 00 00 00 41 00 00 00 42 00 00 43
UTF-32LE 41 00 00 00 42 00 00 00 43 00 00
UTF-32) 00 00 Fe ff 00 00 00 41 00 00 42 00 00 00 43
UTF-32 (little endian) FF Fe 00 00 41 00 00 00 42 00 00 00 43 00 00 00
UTF-32 (without BOM) 00 00 00 41 00 00 00 42 00 00 43
UTF-8

One disadvantage of UTF-16 and UTF-32 is that they are fixed to use two or four bytes so that there will be a lot of 00 bytes when representing plain ASCII files, resulting in waste. The UTF-8 defined by rfc3629 addresses this problem.

UTF-8 with 1 ~ 4 bytes to represent the Code Point. The format is as follows:

UCS-2 (UCS-4) Bit Sequence First byte Second byte Third byte Fourth byte
U + 0000 .. u + 007f 2017-0xxxxxxx 0 xxxxxxx
U + 0080 .. u + 07ff 00000xxx-xxyyyyyy 110 XXXXX 10 yyyyyy
U + 0800 .. u + FFFF Xxxxyyyy-yyzzzzzz 1110 xxxx 10 yyyyyy 10 Zzzzzz
U + 10000 .. u + 1 fffff 201710000-000wwwxx-
Xxxxyyyy-yyzzzzzzz
11110www 10 xxxxxx 10 yyyyyy 10 Zzzzzz

Visible, ASCII characters (U + 0000 ~ U + 007f) partially uses one byte, avoiding the waste of storage space. And the UTF-8 no longer requires BOM bytes.

In addition, we can see from the table that the first byte of a single byte encoding is [00-7f], the first byte of Double Byte encoding is [C2-DF], the first byte of three Byte encoding is [E0-EF]. In this way, you can know the number of encoded bytes as long as you see the range of the first byte. This can also greatly simplify the algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.