[C/C ++] various C/C ++ compilers are used to test the compatibility of UTF-8 source code files (VC, GCC, BCB)

Source: Internet
Author: User

Author: zyl910

When developing C/C ++ programs on different platforms, in order to avoid garbled source code files, we have to use UTF-8 encoding to store source code files. But many compilers have poor compatibility with UTF-8 source code files, so I did some tests and analyzed the best saving solution.

I. Test Procedures

To test the compiler's compatibility with UTF-8 source code files, I have compiled such a test program --

// # If _ msc_ver> = 1600 // vc2010 // # pragma execution_character_set ("UTF-8") // # endif # include <stdio. h> # include <locale. h> # include <string. h> # include <wchar. h> char * PSA = "\ u4e00 word a"; wchar_t * PDW = l "\ u4e00 word w"; int main (INT argc, char * argv []) {char * pA; wchar_t * PW; setlocale (lc_all, ""); // use the current code page of the system. // char printf ("Len <% d> = % d, STR = % s \ t //", sizeof (char), strlen (PSA), PSA ); for (Pa = psa; * pa! = 0; ++ Pa) printf ("%. 2x ", (unsigned char) * pA); printf (" \ n "); // wchar_t printf (" Len <% d> = % d, STR = % ls \ t // ", sizeof (wchar_t), wcslen (PDW), PDW); For (PW = PDW; * PW! = 0; ++ PW) printf ("%. 4x", (unsigned INT) * PW); printf ("\ n"); Return 0 ;}

 

If the default encoding is gb2312 (for example, in Chinese Windows), the output result of this program is --
Len <1> = 5, STR = A // D2 BB D7 D6 41
Len <2> = 3, STR = W // 4e00 5b57 0057

If the system's default encoding is a UTF-8 (such as a Linux system), the output of the program should be --
Len <1> = 7, STR = A // E4 B8 80 E5 ad 97 41
Len <4> = 3, STR = W // 4e00 5b57 0057

Note:
1. The angle brackets next to "len" contain the width of the character type. The char type is generally 1 byte. The wchar_t type is related to the compiler and the operating system. It is generally 2 bytes in windows and 4 bytes in Linux.
2. "Len <?> = "The number on the right is the number of characters. In char type, a Chinese character gb2312 encoding is 2 characters, a Chinese character UTF-8 encoding is generally 3 characters. For the wchar_t type, a Chinese character is generally 1 character.
3. The right side of "str =" is the displayed string.
4. The right side of "//" is used to display the value of each character.

Ii. Test Results

Need to test these aspects --
1. Test multiple compilers under different operating systems respectively.
2. Non-Signed UTF-8 with signed UTF-8. There are two kinds of UTF-8 storage solutions, one is a signed UTF-8, the other is a signed UTF-8, the difference between the two solutions is-whether there is a signature character (BOM ).
3. Execute the character set. Vc2010 added "# pragma execution_character_set (" UTF-8 ")" to indicate that the char execution character set is UTF-8 encoding.

According to the above requirements, a test project has been developed, including windows testing and Linux testing.
Windows platform tests include --
[Vc6, nobom]: vc6.0 SP1, the source code uses the unsigned UTF-8 encoding.
[Vc6, BOM]: vc6.0 SP1, source code uses a signed UTF-8 encoding.
[Vc2003, nobom]: vc2003 SP1, the source code uses the unsigned UTF-8 encoding.
[Vc2003, BOM]: vc2003 SP1, the source code uses a signed UTF-8 encoding.
[Vc2005, nobom]: vc2005 SP1, the source code uses the unsigned UTF-8 encoding.
[Vc2005, BOM]: vc2005 SP1, source code using signed UTF-8 encoding.
[Vc2010, nobom]: vc2010 SP1, the source code uses the unsigned UTF-8 encoding.
[Vc2010, BOM]: vc2010 SP1, source code uses a signed UTF-8 encoding.
[Vc2010, nobom, execution_character_set]: vc2010 SP1, the source code uses the unsigned UTF-8 encoding, and uses "# pragma execution_character_set (" UTF-8 ")".
[Vc2010, BOM, execution_character_set]: vc2010 SP1, source code is encoded with a signed UTF-8, and use "# pragma execution_character_set (" UTF-8 ")".
[Bcb6, nobom]: Borland C ++ Builder 6.0, source code uses a non-Signed UTF-8 encoding.
[Bcb6, BOM]: Borland C ++ Builder 6.0, source code uses a signed UTF-8 encoding.
[GCC (mingw), nobom]: GCC 4.6.2 In mingw, source code is encoded using a non-Signed UTF-8.
[GCC (mingw), BOM]: GCC 4.6.2 In mingw, source code is encoded using a signed UTF-8.

Tests on the Linux platform include --
[GCC (Fedora), nobom, CHS]: fedora 17 comes with GCC 4.7.0, source code uses non-Signed UTF-8 encoding, the system language is set to "simplified Chinese ".
[GCC (Fedora), BOM, CHS]: fedora 17 comes with GCC 4.7.0, source code uses signed UTF-8 encoding, the system language is set to "simplified Chinese ".
[GCC (Fedora), nobom, Eng]: fedora 17 comes with GCC 4.7.0, source code uses non-Signed UTF-8 encoding, the system language is set to "English ".
[GCC (Fedora), BOM, Eng]: fedora 17 comes with GCC 4.7.0, source code uses signed UTF-8 encoding, the system language is set to "English ".

The test results are summarized as follows (the comment I wrote after the Semicolon )--

[Vc6, nobom] Len <1> = 9, STR = u4e00 Escape Character // 75 34 65 30 30 E5 ad 97 41; vc6 cannot recognize "\ U" escape characters, "u4e00" is output directly ". Len <2> = 7, STR = u4e00 running failed // 0075 0034 0065 0030 0030 701b 6882 [vc6, BOM] cannot be compiled! Because the BOM character is used as an incorrect statement by the compiler. [Vc2003, nobom] Len <1> = 0, STR = //; the compiler cannot recognize strings. Len <2> = 3, STR = a queue member // 4e00 701b 6882 [vc2003, BOM] Len <1> = 0, STR = // Len <2> = 3, STR = W // 4e00 5b57 0057 [vc2005, nobom] Len <1> = 6, STR = a pair of keys // D2 BB E5 ad 97 41len <2> = 3, STR = A Pair running // 4e00 701b 6882 [vc2005, BOM] Len <1> = 5, STR = A/D2 BB D7 D6 41len <2> = 3, STR = W // 4e00 5b57 0057 [vc2010, nobom] Len <1> = 6, STR = a pair of keys // D2 BB E5 ad 97 41; the UTF-8 of the word A is encoded as "E5 ad 97 41", and the compiler recognizes them as gb2312 encoded "gb2312 encoding" and stores it as a gb2312 string. Len <2> = 3, STR = a pair of bytes // 4e00 701b 6882; "W" UTF-8 code for "E5 ad 97 57 ", the compiler recognizes them as gb2312 encoded "character strings" and stores them as UTF-16 strings. [Vc2010, BOM] Len <1> = 5, STR = A // D2 BB D7 D6 41; because of the BOM, the compiler correctly recognizes the string, and save it as a gb2312 string. Len <2> = 3, STR = word w // 4e00 5b57 0057; because of BOM, the compiler correctly recognizes the string and stores it as a UTF-16 string. [Vc2010, nobom, execution_character_set] Len <1> = 8, STR = 1 Gbit/D2 BB E7 80 9B E6 A1 9D; "\ u4e00" is identified as "1" and stored as gb2312 encoded "D2 BB ". The UTF-8 of the word A is encoded as "E5 ad 97 41", and the compiler recognizes them as "gb2312 encoding ", and store the "E7 80 9B E6 A1 9d" encoded for the UTF-8 ". However, the system uses gb2312 encoding by default. Len <2> = 3, STR = a queue member // 4e00 701b 6882 [vc2010, BOM, execution_character_set] Len <1> = 6, STR = a primary node // D2 BB E5 ad 97 41; "\ u4e00" is identified as "one" and stored as gb2312 encoding "D2 BB ". The UTF-8 code for "word a" is "E5 ad 97 41", and the compiler correctly stores it as a UTF-8 code. However, the system uses gb2312 encoding by default. Len <2> = 3, STR = W // 4e00 5b57 0057 [bcb6, nobom] Len <1> = 6, STR = a primary member // D2 BB E5 ad 97 41len <2> = 3, STR = a primary member // 4e00 701b 6882 [bcb6, BOM] cannot be compiled! Because the BOM character is used as an incorrect statement by the compiler. [GCC (mingw), nobom] Len <1> = 7, STR = € // E4 B8 80 E5 ad 97 41; the storage is UTF-8 encoded. However, the system uses gb2312 encoding by default. Len <2> = 3, STR = W // 4e00 5b57 0057 [GCC (mingw), BOM] Len <1> = 7, STR = € // E4 B8 80 E5 ad 97 41len <2> = 3, STR = W // 4e00 5b57 0057 [GCC (Fedora), nobom, CHS] Len <1> = 7, STR = A // E4 B8 80 E5 ad 97 41; stored as UTF-8 encoding. The system uses the zh_cn.utf8 encoding by default. Len <4> = 3, STR = W // 4e00 5b57 0057 [GCC (Fedora), BOM, CHS] Len <1> = 7, STR = A // E4 B8 80 E5 ad 97 41len <4> = 3, STR = W // 4e00 5b57 0057 [GCC (Fedora), nobom, eng] Len <1> = 7, STR = A // E4 B8 80 E5 ad 97 41; stored as UTF-8 encoding. The system uses en_us.utf8 encoding by default. Len <4> = 3, STR = W // 4e00 5b57 0057 [GCC (Fedora), BOM, Eng] Len <1> = 7, STR = A // E4 B8 80 E5 ad 97 41len <4> = 3, STR = W // 4e00 5b57 0057

Iii. Result Analysis

Observe the test results. First, we can find the following points --
Neither vc6 nor bcb6 can compile code files encoded with a signature UTF-8, which treat the signature character (BOM) as an incorrect statement.
Vc6 cannot recognize the "\ U" Escape Character.
Vc2003 cannot identify the UTF-8-encoded char.

3.1 Principle Analysis

Vc2010 is the most typical test in windows.

During compilation, the following two character sets are involved in string processing --
The source Character Set: Specifies the encoding used to save the source code file.
The execution character set: the encoding stored in the executable program.

To make the program not garbled, you must meet the requirements --
1) the compiler accurately identifies the source code character set to obtain the correct string data.
2) The runtime environment encoding is the same as the execution character set. You can use the setlocale function to configure the runtime environment encoding. "setlocale (lc_all," ")" indicates that the system default encoding is used. For Simplified Chinese Windows, it is generally gb2312. If the execution character set is the same, it will be displayed normally; otherwise, it will be garbled.

Vc2010 handles it like this --
Source Character Set: if there is a signature character, it will be parsed according to its encoding; otherwise, the local locale character set will be used.
Execution Character Set: For the char type, if "# pragma execution_character_set" is available, the string is stored according to its encoding; otherwise, the local locale character set is used. For the wchar_t type, always use UTF-16 encoding.

When the source code uses signed UTF-8 encoding, vc2010 can correctly identify the source character set is UTF-8. Then, because there is no "# pragma execution_character_set", the execution character set is the local locale character set --
[Vc2010, BOM]
Len <1> = 5, STR = A // D2 BB D7 D6 41; because of BOM, the compiler correctly recognizes the string and stores it as a gb2312 string.
Len <2> = 3, STR = word w // 4e00 5b57 0057; because of BOM, the compiler correctly recognizes the string and stores it as a UTF-16 string.

When the source code is encoded using a non-Signed UTF-8, vs2010 cannot find the signature character set, the source code character set is mistaken for the local locale character set. Then, because there is no "# pragma execution_character_set", the execution character set is the local locale character set --
[Vc2010, nobom]
Len <1> = 6, STR = a pair of samples/D2 BB E5 ad 97 41; "word a" UTF-8 code for "E5 ad 97 41 ", the compiler recognizes them as gb2312 encoded "gb2312 encoding" and stores them as gb2312 strings.
Len <2> = 3, STR = a pair of bytes // 4e00 701b 6882; "W" UTF-8 code for "E5 ad 97 57 ", the compiler recognizes them as gb2312 encoded "character strings" and stores them as UTF-16 strings.

The situation becomes more complex when the execution character set is configured as UTF-8 with '# pragma execution_character_set ("UTF-8. Let's take a look at vc2010's ability to correctly identify signature files with the source character set --
[Vc2010, BOM, execution_character_set]
Len <1> = 6, STR = 1 Gbit/s/D2 BB E5 ad 97 41; "\ u4e00" is identified as "1 ", and the storage is gb2312 encoded as "D2 BB ". The UTF-8 code for "word a" is "E5 ad 97 41", and the compiler correctly stores it as a UTF-8 code. However, the system uses gb2312 encoding by default.
Len <2> = 3, STR = W // 4e00 5b57 0057

Then let's look at the situation when there is no signature. Vs2010 because the signature character cannot be found, the source code character set is mistaken for the local locale character set, that is, the UTF-8 is identified as gb2312 by mistake. Then, according to the execution character set, it is converted into a UTF-8 for storage. At runtime, because the default encoding is gb2312, The UTF-8 is identified as gb2312 again by mistake --
[Vc2010, nobom, execution_character_set]
Len <1> = 8, STR = 1 hour/D2 BB E7 80 9B E6 A1 9D; "\ u4e00" is identified as "1 ", and the storage is gb2312 encoded as "D2 BB ". The UTF-8 of the word A is encoded as "E5 ad 97 41", and the compiler recognizes them as "gb2312 encoding ", and store the "E7 80 9B E6 A1 9d" encoded for the UTF-8 ". However, the system uses gb2312 encoding by default.
Len <2> = 3, STR = 1 million bytes // 4e00 701b 6882

From the above two examples, we found a bug in vc2010 -- "# pragma execution_character_set" is invalid for "\ U" escape characters, and "\ U" escape characters always use the local locale character set, instead of executing character sets.

3.2 GCC Analysis

GCC's source code Character Set and execution character set are UTF-8 encoding by default, because most of today's Linux systems use UTF-8 encoding. Even after the Linux system language is adjusted, the character encoding is still a UTF-8, but the region has changed. Therefore, our program can correctly display Chinese Characters Under "simplified Chinese" and "English.

The GCC in mingw is also like this, the source code Character Set and the execution character set are UTF-8 encoding by default. However, the default encoding for windows in simplified Chinese is gb2312, And the printf output UTF-8 string is mistaken for gb2312, causing garbled characters.

3.2 Best Solution

If the String constant does not contain non-ASCII characters, it is recommended that the source code file be encoded with a non-Signed UTF-8, which supports early compilers.
If a String constant contains non-ASCII characters, it is recommended that the source code file be encoded with a signed UTF-8 so that most compilers can correctly process the source code character set.

Supplement --
1. Note that the condition is only "The String constant does not contain non-ASCII characters ". If you get a non-ASCII string from an external file or other means, you just need to select the appropriate string function, the source code file without signature UTF-8 encoding can also work.
2. The "# pragma execution_character_set" added by vc2010 is used to explicitly require the UTF-8 string. Because Windows does not have locale for UTF-8, it is less practical,

References --
ISO/IEC 9899: 1999 (c99). ISO/IEC, 1999. Www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
C99 standard. Yourtommy. Http://blog.csdn.net/yourtommy/article/details/7495033
Qstring (2). Dbzhang800. Http://blog.csdn.net/dbzhang800/article/details/7540905

 

Download source code --
Http://files.cnblogs.com/zyl910/testwchar.rar

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.