Char to WCHAR conversion essence

Last Update:2015-06-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, from Char to wchar_t

"This problem is more complicated than you think."

From character to Integer

Char is an integer type, meaning that the characters that char can represent are integer types in C + +. Well, next, a lot of articles will cite a typical example, for example, the value of ' a ' is 0x61. Is that the right idea? If you read carefully the original version of K&r and BS for C and C + +, you will immediately refute that 0x61 is just the ASCII value of ' a ', and that there is no rule that the char value of C/D + + must correspond to ASCII. C + + does not even stipulate that char occupies a few, only to specify that sizeof (char) equals 1.
Of course, most of the time, Char is a 8-bit, and the value within the ASCII range corresponds to ASCII.

Localization policy set (locale)

"Translate ' a ' into an integer value of 0x61", "which corresponds to an ASCII range encoding to an integer value of char", a rule similar to that of a particular system and a particular compiler, with a specific noun in C + + that describes the set of rules: Localization policy set (locale). Also translated into "live"). Translation-that is, code conversion (CODECVT) is just one of the collections, defined in C + + as a policy (facet. Also translated as "faceted")

The compilation strategy for C + +

The "Localization policy set" is a good idea, but at the character and string level, C + + is not used (c + + locale usually only affects the stream), and C/s uses a more straightforward strategy: hard-coded.
Simply put, the character (string) is represented in the program file (executable file, non-source file), consistent with the representation in memory in the execution of the program. Consider two scenarios:
A, char c = 0x61;
B, char c = ' a ';
Under scenario A, the compiler can directly recognize C as an integer, but under case B, the compiler must translate ' a ' into integers. The compiler's strategy is also simple to read directly the encoded values of the characters (strings) in the source file. Like what:
Const char* s = "Chinese abc";
This string is encoded in GB2312 (Windows 936), which is our default Chinese system source file for Windows:
0xd6 0xD0 0xCE 0xc4 0x61 0x62 0x63
In UTF-8, which is the Linux default system source file, the encoding is:
0xE4 0xb8 0xAD 0xe6 0x96 0x87 0x61 0x62 0x63
In general, the compiler will be faithful to the source file encoding for s assignment, exceptions such as VC will be smart to convert most of the other types of encoded strings into GB2312 (except for survivors like UTF-8 without signature).
While the program is executing, s also keeps the code so that no other conversions are made.

Wide character wchar_t
Just as Char does not have a specified size, wchar_t also does not have a standard qualification, and the standard simply requires that a wchar_t can represent a character that any system can recognize, in Win32, wchar_t is 16 bits, and Linux is 32 bits. wchar_t also does not specify the code, because the concept of Unicode is explained later, so here just to mention that in Win32, wchar_t code is UCS-2BE, and Linux is utf-32be (equivalent to ucs-4be), but simply, Within 16 bits, the 3 encoding values for a character are the same. So:
Const wchar_t* WS = L "Chinese abc";
The codes are:
0x4e2d 0x6587 0x0061 0x0062 0x0063//win32,16 bit
0x00004e2d 0x00006587 0x00000061 0x00000062 0x00000063//linux,32 bit
The uppercase L is the one that tells the compiler: This is a wide string. So this is where the compiler needs to translate the locale.
For example, in a Windows environment, the compiler's translation strategy is UTF-8 to Utf-32be in a GB2312 to ucs-2be;linux environment.
This time requires the source file encoding and the compiler localization policy centralized code translation policy is consistent, for example, VC can only read GB2312 source code (here or the exception, VC too smart, will be many other code at compile time automatically converted to GB2312), And GCC can only read UTF-8 source (here there is an embarrassment, MinGW run Win32 under, so only the GB2312 system is recognized, and MinGW was written in GCC, so he only recognized UTF-8, so the result is that MinGW wide character is discarded).
The wide character (string) is translated by the compiler, or it is hardcoded into the program file.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Char to WCHAR conversion essence

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Char to WCHAR conversion essence

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support