C ++ implements Unicode Code Conversion to UTF-16 code addition and decoding Functions

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Unicode is implemented in a different way than encoding.The Unicode encoding of a character isOKBut in the actual storage and transmission process, because the design of different system platforms is not necessarily consistent, and for the purpose of saving space, UnicodeThe encoding implementation methods are different.. Unicode format (unicodetransformation format)UTF).

Unicode encoding mainly include UTF-16BE, UTF-16LE, UTF-8, UTF-7 and UTF-32 and other implementation methods, the current commonly used implementation methods are UTF-16LE, UTF-16BE and UTF-8.

UTF-16

The UTF-16 is encoded with 16 bits to express Unicode, so the expression range is 216 (65536), that is, the code unit of the UTF-16 is 16 bits. If the expression of BMP characters, with a UTF-16 of codeunit can be expressed, for the secondary plane characters, UTF-16 has a clever design.

In BMP, the Code Point segment from u + d800 to U + dfff is permanently retained and not mapped to characters, the UTF-16 uses this reserved 0xd800-0xdfff segment's codepoint to encode the character's Code Point in the secondary plane.

Encode U + 0000 .. u + d7ff and U + e000.. U + FFFF

The UTF-16 and UCS-2 encode the codepoint in this range, using a single 16-bit long codeunit, the value is equivalent to the corresponding code point. These code points in BMP are the only code points that can be represented by a UCS-2.

Encoding of U + 10000 .. u + 10ffff

The codepoint in the lementary planes, which is encodedOne pairA 16-bit code unit (32bit, 4 bytes) is calledProxy pair(Surrow.pair ).

The specific method is:

Code point minus 0x10000The obtained value is 20 bits (0 .. 0 xfffff );
Step 1 obtain the value of 10 bits (value range: 0 .. 0x3ff) is added with 0xd800 to get the first code unit, which is also called high surrogate or lead surrogate ). The value range is 0xd800 .. 0 xdbff.
Step 1 obtain the value of 10 bits in the low position of the value (the value range is 0 .. 0x3ff) is added with 0xdc00 to get the second code unit, which is also called low surrogate or trail surrogate ). The value range is 0xdc00 .. 0 xdfff.

In this way, the characters in this range are encoded as a proxy for [leadsurrogate, trail surrogate]: Two 16bits code units, with the value ranges of 0xd800 .. 0xdbff and 0xdc00 .. 0 xdfff. The range of the Code unit obtained in BMP is 0x0000 .. 0 xFFFF (0xd800 .. 0xdfff is retained and does not include it). Therefore, these three segments do not overlap with each other and are easy to implement during decoding.

Decoding UTF-16
Hi \ Lo	Dc00	DC01	...	Dfff
D800	10000	10001	...	103ff
D801	10400	10401	...	0000ff
Bytes	Bytes	Bytes	Bytes	Bytes
Dbff	10fc00	Fc1001	...	10 FFFF

The following takes the U + 64321 UTF-16 encoding as an example, to see how the characters in the secondary plane is encoded:

V = 0x64321vx = V-0x10000 = 0x54321 = 01010100 0011 0010 0001 consecutive H = 01 0101 0000 // 10 bitsvl of the high part of VX = 11 0010 0001 // VX 10 bitsw1 = 0xd800 // the first 16-bit initial value of the result W2 = 0xdc00 // the last 16-bit initial value of the result W1 = W1 | VL = 1101 1000 0000 0000 | 01 0101 0000 = 1101 1001 0101 0000 = 0xd950 W2 = W2 | VL = 1101 1100 0000 0000 | 11 0010 0001 = 1101 1111 0010 0001 = 0xdf21

So the final UTF-16 code for the word U + 64321 is:

0xD950 0xDF21

The following are the C ++ encryption and decoding functions written on this basis:

// Unicode BMP character conversion to UTF-16 CAPTCHA functions // dword v for character unicode encoding value greater than 0 xFFFF // word & W1 for low-level proxy pairs after conversion to UTF-16/ /word & W2 is the high proxy for void encode (dword v, word & W1, word & W2) {DWORD vx = v-0x10000; Word h_= (VX & 0xffc00)> 10; Word VL = VX & 0x03ff; W1 = 0xd800; w2 = 0xdc00; W1 = W1 | VL; W2 = W2 | VL ;}

// Decoding function for converting UTF-16 to unicode encoding // dword v for decoded character unicode encoding value greater than 0 xFFFF // word & W1 UTF-16 low-position proxy pair // word & W2 UTF-16 high proxy for void decode (DWORD & W, word & W1, word & W2) {W1 = W1 & 0x3ff; W2 = W2 & 0x3ff; W = W1 <10; W = w | W2; W + = 0x10000 ;}

Example:

(1) Add code

Wchar STR [3]; // represents a memset (STR, 64321 * sizeof (wchar) character; DW = 0 X; // The Unicode encoding of this character is outside of BMP: Word W1, W2; encode (DW, W1, W2); STR [0] = W1; STR [1] = W2; STR [3] = 0;

(2) decoding

Dword dw = 0; Word W1, W2; W1 = STR [0]; W2 = STR [1]; // W2! = 0 otherwise, decode (DW, W1, W2) in BMP; // DW is the Unicode code value corresponding to the UTF-16

References:

Http://blog.csdn.net/thl789/article/details/7506133

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C ++ implements Unicode Code Conversion to UTF-16 code addition and decoding Functions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

C ++ implements Unicode Code Conversion to UTF-16 code addition and decoding Functions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support