Char to WCHAR conversion essence

Source: Internet
Author: User

1, from Char to wchar_t


"This problem is more complicated than you think."

From character to Integer

Char is an integer type, meaning that the characters that char can represent are integer types in C + +. Well, next, a lot of articles will cite a typical example, for example, the value of ' a ' is 0x61. Is that the right idea? If you read carefully the original version of K&r and BS for C and C + +, you will immediately refute that 0x61 is just the ASCII value of ' a ', and that there is no rule that the char value of C/D + + must correspond to ASCII. C + + does not even stipulate that char occupies a few, only to specify that sizeof (char) equals 1.
Of course, most of the time, Char is a 8-bit, and the value within the ASCII range corresponds to ASCII.

Localization policy set (locale)

"Translate ' a ' into an integer value of 0x61", "which corresponds to an ASCII range encoding to an integer value of char", a rule similar to that of a particular system and a particular compiler, with a specific noun in C + + that describes the set of rules: Localization policy set (locale). Also translated into "live"). Translation-that is, code conversion (CODECVT) is just one of the collections, defined in C + + as a policy (facet. Also translated as "faceted")

The compilation strategy for C + +

The "Localization policy set" is a good idea, but at the character and string level, C + + is not used (c + + locale usually only affects the stream), and C/s uses a more straightforward strategy: hard-coded.
Simply put, the character (string) is represented in the program file (executable file, non-source file), consistent with the representation in memory in the execution of the program. Consider two scenarios:
A, char c = 0x61;
B, char c = ' a ';
Under scenario A, the compiler can directly recognize C as an integer, but under case B, the compiler must translate ' a ' into integers. The compiler's strategy is also simple to read directly the encoded values of the characters (strings) in the source file. Like what:
Const char* s = "Chinese abc";
This string is encoded in GB2312 (Windows 936), which is our default Chinese system source file for Windows:
0xd6 0xD0 0xCE 0xc4 0x61 0x62 0x63
In UTF-8, which is the Linux default system source file, the encoding is:
0xE4 0xb8 0xAD 0xe6 0x96 0x87 0x61 0x62 0x63
In general, the compiler will be faithful to the source file encoding for s assignment, exceptions such as VC will be smart to convert most of the other types of encoded strings into GB2312 (except for survivors like UTF-8 without signature).
While the program is executing, s also keeps the code so that no other conversions are made.

Wide character wchar_t
Just as Char does not have a specified size, wchar_t also does not have a standard qualification, and the standard simply requires that a wchar_t can represent a character that any system can recognize, in Win32, wchar_t is 16 bits, and Linux is 32 bits. wchar_t also does not specify the code, because the concept of Unicode is explained later, so here just to mention that in Win32, wchar_t code is UCS-2BE, and Linux is utf-32be (equivalent to ucs-4be), but simply, Within 16 bits, the 3 encoding values for a character are the same. So:
Const wchar_t* WS = L "Chinese abc";
The codes are:
0x4e2d 0x6587 0x0061 0x0062 0x0063//win32,16 bit
0x00004e2d 0x00006587 0x00000061 0x00000062 0x00000063//linux,32 bit
The uppercase L is the one that tells the compiler: This is a wide string. So this is where the compiler needs to translate the locale.
For example, in a Windows environment, the compiler's translation strategy is UTF-8 to Utf-32be in a GB2312 to ucs-2be;linux environment.
This time requires the source file encoding and the compiler localization policy centralized code translation policy is consistent, for example, VC can only read GB2312 source code (here or the exception, VC too smart, will be many other code at compile time automatically converted to GB2312), And GCC can only read UTF-8 source (here there is an embarrassment, MinGW run Win32 under, so only the GB2312 system is recognized, and MinGW was written in GCC, so he only recognized UTF-8, so the result is that MinGW wide character is discarded).
The wide character (string) is translated by the compiler, or it is hardcoded into the program file.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Char to WCHAR conversion essence

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.