C ++ implements Unicode Code Conversion to UTF-16 code addition and decoding Functions

Source: Internet
Author: User

Unicode is implemented in a different way than encoding.The Unicode encoding of a character isOKBut in the actual storage and transmission process, because the design of different system platforms is not necessarily consistent, and for the purpose of saving space, UnicodeThe encoding implementation methods are different.. Unicode format (unicodetransformation format)UTF).

Unicode encoding mainly include UTF-16BE, UTF-16LE, UTF-8, UTF-7 and UTF-32 and other implementation methods, the current commonly used implementation methods are UTF-16LE, UTF-16BE and UTF-8.

UTF-16

The UTF-16 is encoded with 16 bits to express Unicode, so the expression range is 216 (65536), that is, the code unit of the UTF-16 is 16 bits. If the expression of BMP characters, with a UTF-16 of codeunit can be expressed, for the secondary plane characters, UTF-16 has a clever design.

In BMP, the Code Point segment from u + d800 to U + dfff is permanently retained and not mapped to characters, the UTF-16 uses this reserved 0xd800-0xdfff segment's codepoint to encode the character's Code Point in the secondary plane.

 

Encode U + 0000 .. u + d7ff and U + e000.. U + FFFF

The UTF-16 and UCS-2 encode the codepoint in this range, using a single 16-bit long codeunit, the value is equivalent to the corresponding code point. These code points in BMP are the only code points that can be represented by a UCS-2.

Encoding of U + 10000 .. u + 10ffff

The codepoint in the lementary planes, which is encodedOne pairA 16-bit code unit (32bit, 4 bytes) is calledProxy pair(Surrow.pair ).

 

 

The specific method is:

  1. Code point minus 0x10000The obtained value is 20 bits (0 .. 0 xfffff );
  2. Step 1 obtain the value of 10 bits (value range: 0 .. 0x3ff) is added with 0xd800 to get the first code unit, which is also called high surrogate or lead surrogate ). The value range is 0xd800 .. 0 xdbff.
  3. Step 1 obtain the value of 10 bits in the low position of the value (the value range is 0 .. 0x3ff) is added with 0xdc00 to get the second code unit, which is also called low surrogate or trail surrogate ). The value range is 0xdc00 .. 0 xdfff.

In this way, the characters in this range are encoded as a proxy for [leadsurrogate, trail surrogate]: Two 16bits code units, with the value ranges of 0xd800 .. 0xdbff and 0xdc00 .. 0 xdfff. The range of the Code unit obtained in BMP is 0x0000 .. 0 xFFFF (0xd800 .. 0xdfff is retained and does not include it). Therefore, these three segments do not overlap with each other and are easy to implement during decoding.

 

Decoding UTF-16

Hi \ Lo

Dc00

DC01

...

Dfff

D800

10000

10001

...

103ff

D801

10400

10401

...

0000ff

Bytes

Bytes

Bytes

Bytes

Bytes

Dbff

10fc00

Fc1001

...

10 FFFF

 

 

 

 

 

 

 

The following takes the U + 64321 UTF-16 encoding as an example, to see how the characters in the secondary plane is encoded:

V = 0x64321vx = V-0x10000 = 0x54321 = 01010100 0011 0010 0001 consecutive H = 01 0101 0000 // 10 bitsvl of the high part of VX = 11 0010 0001 // VX 10 bitsw1 = 0xd800 // the first 16-bit initial value of the result W2 = 0xdc00 // the last 16-bit initial value of the result W1 = W1 | VL = 1101 1000 0000 0000 | 01 0101 0000 = 1101 1001 0101 0000 = 0xd950 W2 = W2 | VL = 1101 1100 0000 0000 | 11 0010 0001 = 1101 1111 0010 0001 = 0xdf21

So the final UTF-16 code for the word U + 64321 is:

0xD950 0xDF21

The following are the C ++ encryption and decoding functions written on this basis:

// Unicode BMP character conversion to UTF-16 CAPTCHA functions // dword v for character unicode encoding value greater than 0 xFFFF // word & W1 for low-level proxy pairs after conversion to UTF-16/ /word & W2 is the high proxy for void encode (dword v, word & W1, word & W2) {DWORD vx = v-0x10000; Word h_= (VX & 0xffc00)> 10; Word VL = VX & 0x03ff; W1 = 0xd800; w2 = 0xdc00; W1 = W1 | VL; W2 = W2 | VL ;}

// Decoding function for converting UTF-16 to unicode encoding // dword v for decoded character unicode encoding value greater than 0 xFFFF // word & W1 UTF-16 low-position proxy pair // word & W2 UTF-16 high proxy for void decode (DWORD & W, word & W1, word & W2) {W1 = W1 & 0x3ff; W2 = W2 & 0x3ff; W = W1 <10; W = w | W2; W + = 0x10000 ;}

Example:

(1) Add code

Wchar STR [3]; // represents a memset (STR, 64321 * sizeof (wchar) character; DW = 0 X; // The Unicode encoding of this character is outside of BMP: Word W1, W2; encode (DW, W1, W2); STR [0] = W1; STR [1] = W2; STR [3] = 0;

(2) decoding

Dword dw = 0; Word W1, W2; W1 = STR [0]; W2 = STR [1]; // W2! = 0 otherwise, decode (DW, W1, W2) in BMP; // DW is the Unicode code value corresponding to the UTF-16

References:

Http://blog.csdn.net/thl789/article/details/7506133

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.