Are you sure Windows XP is UCS-2 or UTF-16?

Source: Internet
Author: User
Tags ultraedit

It is generally considered that Unicode represented in 16bit in windows is not a UTF-16, but a UCS-2. UCS-2 is an encoding format, and also refers to the Unicode implementation with a one-to-one correspondence. In the UCS-2 can only represent U + 0000 to U + ffff bmp (Basic multilingual plane) unicode encoding range, is a fixed length Unicode implementation, while the UTF-16 is variable length, similar to the implementation of the UTF-8, but because of the increase in the length of the byte, the BMP part also achieves a one-to-one correspondence, but through the combination of two dubyte can achieve all Unicode, the range is from u + 0000 to U + 10 FFFF. On this point, I have seen confusion in many places, mixed myself are a little not sure of their own statement, but also good in the UTF-16/UCS-2 or difference, otherwise, I don't know where to find a correct answer. (Even on IBM's related web pages, the UCS-2 is listed as an alias for the UTF-16)

In UTF-16/UCS-2, there are the following:

UTF-16 is the native internal representation of text in the Microsoft Windows 2000/XP/2003/Vista/CE; Qualcomm brew operating systems; the Java and. net bytecode environments; Mac OS X's cocoa and core Foundation frameworks; and the QT cross-platform graphical Widget Toolkit. [1] [2] [Citation needed]

Symbian OS used in Nokia s60 handsets and Sony Ericsson uiq handsets uses UCS-2.

The Joliet file system, used in CD-ROM media, encodes filenames using UCS-2BE (up to 64 Unicode characters per file ).

Older windows NT systems (prior to Windows 2000) only support UCS-2. [3]. in Windows XP, no code point above U + FFFF is supported ded in any font delivered with windows for European ages, possibly with Chinese Windows versions. [Clarification needed]

It is clear that Windows 2000 kernel has been a UTF-16 after, this is really against the usual feeling, so you can test. Encoding conversion functions in UTF-16 (Python implementation)

In Windows, I output three taoxuan characters,""But it is actually output in ultraedit. Although it is displayed in Windows, it may also be the ultraedit function. We use windowsapi to display it this time to prove that, the Windows kernel is indeed capable of identifying and displaying these three things that are too mysterious. As to why can show too Xuan Jing character represents the kernel is UTF-16, because the UCS-2 can only represent to the BMP range of characters, too Xuan Jing character beyond its representation of the range, if you have an old computer installed with an earlier version of Windows NT, you can use the same example to try it. It should not be displayed.

Int _ tmain (INT argc, _ tchar * argv [])

{

Wchar_t LWC [8];

LWC [0] = 0xd834;

LWC [1] = 0xdf00;

LWC [2] = 0xd834;

LWC [3] = 0xdf01;

LWC [4] = 0xd834;

LWC [5] = 0xdf02;

LWC [6] = 0;

LWC [7] = 0;

Messageboxw (null, LWC, LWC, mb_ OK );

Return 0;

}

A dialog box is displayed.Character. Apparently, the Windows kernel is capable of correctly identifying UTF-16 characters.

But why do I usually say that Windows is a UCS-2? Because the programming interface is UCS-2, simply cannot understand beyond the UCS-2 but is indeed a UTF-16 character, such as the above too Xuan Jing character. In the above example, it only has three characters and can be correctly displayed, but see the following example:

Int _ tmain (INT argc, _ tchar * argv [])

{

Wchar_t LWC [8];

LWC [0] = 0xd834;

LWC [1] = 0xdf00;

LWC [2] = 0xd834;

LWC [3] = 0xdf01;

LWC [4] = 0xd834;

LWC [5] = 0xdf02;

LWC [6] = 0;

LWC [7] = 0;

Int I = wcslen (LWC );

Printf ("% d \ n", I );

Int J = lstrlenw (LWC );

Printf ("% d \ n", J );

Return 0;

}

I and j are both 6, that is, whether it is the C language library function (wcslen) in Windows or its API (lstrlenw is a Windows API, this is not surprising ), are not correct identification of UTF-16 characters, even the number is not a number, so the actual programming experience is, although its kernel UTF-16, but you still can only use when the UCS-2 -_-!

The above tests may not be completely convincing. Let's look at the MFC example (the MFC version under vs2005)

Old Place http://groups.google.com/group/jiutianfile/

There is a testunicodemfc.rar project. Open it and you will know. When the 3 characters in the input box of taoxuan, the length calculated by the cedit control is 6, most of which is the same as that of the previous multi-byte, when you delete a character (Click backspace), you delete not a too Xuan Jing character, but half of the last character, and the last character disappears, however, you find that half of them exist, and then calculate the length, and the output is 5. The only thing better than before is that there is no garbled because this single UTF-16 character has exceeded the range that the UCS-2 can represent, so it doesn't make sense.

So far, my conclusion is that Windows is already supporting UTF-16 from the kernel, but you still have to program on the UCS-2 -_-!

See UTF-16/UCS-2 for this part.

For more information about the Unicode encoding range, see mapping of Unicode Character planes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.