UTF-8 (VC, GCC, BCB) compatibility test for the source code files of c/C + + compilers

Last Update:2014-08-28 Source: Internet

Author: User

Tags locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the development of C/D + + programs on different platforms, in order to avoid garbled source files, you have to use UTF-8 encoding to store source files. But a lot of compilers on the UTF-8 source file compatibility is poor, so I did some tests, analysis of the best saving scheme.

First, the test procedure

To test the compiler's compatibility with UTF-8 source files, I wrote a test program that

#if _msc_ver >=    //vc2010//#pragma execution_character_set ("utf-8")//#endif # include <stdio.h># Include <locale.h> #include <string.h> #include <wchar.h>char* PSA = "\u4e00 word a"; wchar_t* PDW = L "\ u4e00 word w "; int main (int argc, char* argv[]) {    char* pa;    wchar_t* PW;    SetLocale (Lc_all, "");    Use the System current code page.    Char    printf ("len<%d>=%d,str=%s\t//", sizeof (char), strlen (PSA), PSA);    for (PA=PSA; *pa!=0; ++pa)    printf ("%.2x", (unsigned char) *pa);    printf ("\ n");        wchar_t    printf ("len<%d>=%d,str=%ls\t//", sizeof (wchar_t), wcslen (PDW), PDW);    for (PW=PDW; *pw!=0; ++pw)    printf ("%.4x", (unsigned int) *PW);    printf ("\ n");    return 0;}

If the system default encoding is GB2312 (such as the Chinese Windows system), the output of the program should be-
Len<1>=5,str= Word A//D2 BB D7 D6 41
len<2>=3,str= w//4E00 5b57 0057

If the system default encoding is UTF-8 (such as a Linux system), the output of the program should be-
Len<1>=7,str= Word A//E4 B8 E5 AD 97 41
len<4>=3,str= w//4E00 5b57 0057

Note:
1. The width of the character type is within the angle brackets of "Len". The char type is typically 1 bytes. While the wchar_t type is related to the compiler and the operating system, the Windows platform generally 2 bytes under the Linux platform generally 4 bytes.
2. The number to the right of "len<?>=" is the number of characters. With char type, the GB2312 encoding of a Chinese character is 2 characters, and the UTF-8 encoding of a Chinese character is generally 3 characters. For the wchar_t type, a Chinese character is typically 1 characters.
3. On the right side of "str=" is the displayed string.
4. The right side of "//" is used to display the value of each character.

Second, the test results

These aspects need to be tested-
1. Test various compilers under different operating systems, respectively.
2. Unsigned UTF-8 with signed UTF-8. There are two types of UTF-8 storage schemes, one is unsigned UTF-8 and the other is signed UTF-8, the difference between these two schemes is whether there is a signature character (BOM).
3. Execute the character set. VC2010 adds "#pragma execution_character_set (" Utf-8 "), which indicates that the execution character set of char is UTF-8 encoded.

According to the above requirements, the test project has been developed, respectively, the Windows platform under the test and testing under the Linux platform.
The test under the window platform has--
[VC6, nobom]:vc6.0 SP1, the source code uses unsigned UTF-8 encoding.
[VC6, bom]:vc6.0 SP1, the source code uses signed UTF-8 encoding.
[VC2003, nobom]:vc2003 SP1, the source code uses unsigned UTF-8 encoding.
[VC2003, bom]:vc2003 SP1, the source code uses signed UTF-8 encoding.
[VC2005, nobom]:vc2005 SP1, the source code uses unsigned UTF-8 encoding.
[VC2005, bom]:vc2005 SP1, the source code uses signed UTF-8 encoding.
[VC2010, nobom]:vc2010 SP1, the source code uses unsigned UTF-8 encoding.
[VC2010, bom]:vc2010 SP1, the source code uses signed UTF-8 encoding.
[VC2010, Nobom, execution_character_set]:vc2010 SP1, the source code uses unsigned UTF-8 encoding, and uses the "#pragma execution_character_set (" Utf-8 ")”。
[VC2010, BOM, execution_character_set]:vc2010 SP1, source code using UTF-8 encoded with signature, and using "#pragma execution_character_set (" Utf-8 ") ”。
[BCB6, Nobom]:borland C + + Builder 6.0, the source code uses unsigned UTF-8 encoding.
[BCB6, Bom]:borland C + + Builder 6.0, the source code uses signed UTF-8 encoding.
[GCC (mingw), Nobom]:mingw in the GCC 4.6.2, the source code using unsigned UTF-8 encoding.
[GCC (mingw), Bom]:mingw in the GCC 4.6.2, the source code using signed UTF-8 encoding.

The tests under the Linux platform have--
[GCC (Fedora), Nobom, Chs]:fedora 17 comes with the GCC 4.7.0, the source code uses unsigned UTF-8 encoding, the system language is set to "Simplified Chinese."
[GCC (Fedora), BOM, Chs]:fedora 17 comes with the GCC 4.7.0, the source code using signed UTF-8 encoding, the system language is set to "Simplified Chinese."
[GCC (Fedora), Nobom, Eng]:fedora 17 comes with GCC 4.7.0, the source code uses unsigned UTF-8 encoding, and the system language is set to "English".
[GCC (Fedora), BOM, Eng]:fedora 17 comes with the GCC 4.7.0, the source code uses signed UTF-8 encoding, the system language is set to "English."

The test results are summarized as follows (semicolon ";") After I wrote the note)--

[VC6, Nobom]len<1>=9,str=u4e00 Ying Masumizu//-E5 AD 97 41; VC6 does not recognize the "\u" escape character and outputs "u4e00" directly. Len<2>=7,str=u4e00 Ying 梂//0075 0034 0065 0030 0030 701B 6882[vc6, BOM] cannot compile!; The BOM character is treated as the wrong statement by the compiler. [VC2003, nobom]len<1>=0,str=//; The compiler does not recognize the string. len<2>=3,str= 梂//4E00 701B 6882[vc2003, bom]len<1>=0,str=//len<2>=3,str= one word w//4E00 5B57 00 57[vc2005, nobom]len<1>=6,str= a Ying Masumizu//D2 BB E5 AD 41len<2>=3,str= a Ying 梂//4E00 701B 6882[vc2005, Bom]len    <1>=5,str= Word A//D2 BB D7 D6 41len<2>=3,str= One word w//4E00 5b57 0057[vc2010, nobom]len<1>=6,str= one Ying Masumizu D2 BB E5 AD 97 41; The UTF-8 of "word a" is encoded as "E5 AD 97 41", and the compiler recognizes them as GB2312 encoded "Ying Masumizu" and stores them as GB2312 strings. Len<2>=3,str= One Ying 梂//4E00 701B 6882; The UTF-8 of "word w" is encoded as "E5 AD 97 57", and the compiler recognizes them as GB2312 encoded "Ying 梂" and stores them as UTF-16 strings. [VC2010, bom]len<1>=5,str= Word A//D2 BB D7 D6 41; Because of the BOM, the compiler correctly recognizes the string and stores it as a GB2312 string. len<2>=3,str= w//4E00 5b57 0057; With BOM, compilecorrectly recognizes the string and stores it as a UTF-16 string. [VC2010, Nobom, execution_character_set]len<1>=8,str= a 鐎 tears//D2 BB E7 9B E6 A1 9D; "\u4e00" is identified as "one" and stored as GB2312 encoded "D2 BB". "Word a" UTF-8 encoded as "E5 AD 97 41", the compiler identified them as GB2312 encoded "Ying Masumizu" and stored as UTF-8 encoded "E7 9B E6 A1 9D". However, the system defaults to GB2312 encoding when displayed. Len<2>=3,str= One Ying 梂//4E00 701B 6882[vc2010, BOM, execution_character_set]len<1>=6,str= one Ying Masumizu//D2 BB E5 AD 97 41; "\u4e00" is identified as "one" and stored as GB2312 encoded "D2 BB". The UTF-8 of "word a" is encoded as "E5 AD 97 41", and the compiler correctly stores it as UTF-8 encoding. However, the system defaults to GB2312 encoding when displayed. len<2>=3,str= w//4E00 5b57 0057[bcb6, nobom]len<1>=6,str= a Ying Masumizu//D2 BB E5 AD 41len<2>=3,str= One Ying 梂//4E00 701B 6882[bcb6, BOM] cannot compile!; The BOM character is treated as the wrong statement by the compiler. [GCC (MinGW), nobom]len<1>=7,str= Juan € Ying Masumizu//E4 B8, E5 AD 97 41; stored as UTF-8 encoding. However, the system defaults to GB2312 encoding when displayed. len<2>=3,str= w//4E00 5b57 0057[GCC (mingw), bom]len<1>=7,str= Juan € Ying Masumizu//E4 B8, E5 AD 41len<2> =3,str= w//4E00 5b57 0057[GCC (Fedora), Nobom, chs]len<1>=7,str= Word A//E4B8 E5 AD 97 41; stored as UTF-8 encoding. When displayed, the system defaults to ZH_CN.UTF8 encoding, normal output. len<4>=3,str= w//4E00 5b57 0057[GCC (Fedora), BOM, chs]len<1>=7,str= A//E4 B8, E5 AD 41len< 4>=3,str= w//4E00 5b57 0057[GCC (Fedora), Nobom, eng]len<1>=7,str= Word A//E4 B8 AD 97 41; stored as UTF-8 encoding. When displayed, the system defaults to EN_US.UTF8 encoding, normal output. len<4>=3,str= w//4E00 5b57 0057[GCC (Fedora), BOM, eng]len<1>=7,str= A//E4 B8, E5 AD 41len< 4>=3,str= w//4E00 5b57 0057

Third, the result analysis

Observing the test results, we can first find the following points--
Both VC6 and BCB6 cannot compile code files with signed UTF-8 encoded, which use the signature character (BOM) as the wrong statement.
VC6 does not recognize the "\u" escape character.
VC2003 UTF-8 encoded char is not recognized.

3.1 Principle Analysis

Tests under windows are typical of VC2010, which is explained in this example.

During the compilation process, the following two character sets are involved in the processing of strings--
Source Character set (the source character set): What encoding is used to save the source file.
Execution character set (the execution character set): What encoding is stored within the executable program.

To make the program not garbled, must be satisfied--
1) The compiler accurately identifies the source character set and obtains the correct string data.
2) The encoding of the running environment is the same as the execution character set. The encoding of the running environment can be configured by the setlocale function, "setlocale (Lc_all," ")" means using the system default encoding. For the simplified Chinese Windows is generally GB2312, if the execution of the same character set, it will be displayed normally, otherwise it will be garbled.

That's what VC2010 is dealing with--
SOURCE Character Set: If there is a signature character, it is parsed by its encoding, otherwise the local locale character set is used.
Execute character set: For char type, if there is "#pragma execution_character_set", the string is stored by its encoding, otherwise the local locale character set is used. For wchar_t types, UTF-16 encoding is always used.

When the source code with signed UTF-8 encoding, VC2010 can correctly identify the source character set is UTF-8. Then, because there is no "#pragma execution_character_set", the execution character set is the local locale character set--
[VC2010, BOM]
Len<1>=5,str= Word A//D2 BB D7 D6 41; Because of the BOM, the compiler correctly recognizes the string and stores it as a GB2312 string.
len<2>=3,str= w//4E00 5b57 0057; Because of the BOM, the compiler correctly recognizes the string and stores it as a UTF-16 string.

When the source code uses unsigned UTF-8 encoding, VS2010 cannot find the signature character, and the source character set is mistaken for the local locale character set. Then, because there is no "#pragma execution_character_set", the execution character set is the local locale character set--
[VC2010, Nobom]
Len<1>=6,str= a Ying Masumizu//D2 BB E5 AD 97 41; The UTF-8 of "word a" is encoded as "E5 AD 97 41", and the compiler recognizes them as GB2312 encoded "Ying Masumizu" and stores them as GB2312 strings.
Len<2>=3,str= One Ying 梂//4E00 701B 6882; The UTF-8 of "word w" is encoded as "E5 AD 97 57", and the compiler recognizes them as GB2312 encoded "Ying 梂" and stores them as UTF-16 strings.

When you configure the execution character set to UTF-8 with the #pragma execution_character_set ("Utf-8"), the situation becomes more complicated. Let's take a look at the signed file that VC2010 can correctly identify the source character set--
[VC2010, BOM, Execution_character_set]
Len<1>=6,str= a Ying Masumizu//D2 BB E5 AD 97 41; "\u4e00" is identified as "one" and stored as GB2312 encoded "D2 BB". The UTF-8 of "word a" is encoded as "E5 AD 97 41", and the compiler correctly stores it as UTF-8 encoding. However, the system defaults to GB2312 encoding when displayed.
len<2>=3,str= w//4E00 5b57 0057

Let's look at the case without a signature. VS2010 because the signature character is not found, the source character set is mistaken for the local locale character set, that is, the UTF-8 is recognized as GB2312. The UTF-8 is then stored according to the execution character set and the conversion encoding. Finally at run time because the default encoding is GB2312, again mistakenly UTF-8 recognized as gb2312--
[VC2010, Nobom, Execution_character_set]
len<1>=8,str= 鐎 Tears//D2 BB E7 9B E6 A1 9D; "\u4e00" is identified as "one" and stored as GB2312 encoded "D2 BB". "Word a" UTF-8 encoded as "E5 AD 97 41", the compiler identified them as GB2312 encoded "Ying Masumizu" and stored as UTF-8 encoded "E7 9B E6 A1 9D". However, the system defaults to GB2312 encoding when displayed.
len<2>=3,str= 梂//4E00 701B 6882

From the above 2 examples, it is found that VC2010 has a bug--"#pragma execution_character_set" is invalid for the "\u" escape character, and the "\u" escape character always uses the local locale character set instead of the execution character set.

3.2 GCC Analysis

The source character set of GCC and the execution character set are UTF-8 encoded by default, because most Linux systems now use UTF-8 encoding. Even after the Linux system language has been adjusted, only the area has changed, and the character encoding is still UTF-8. So our program in "Simplified Chinese" and "English", both can display the Chinese characters correctly.

The same is true of GCC in MinGW, where the source character set and the execution character set are UTF-8 encoded by default. But the default encoding for Windows in Simplified Chinese is GB2312, which mistakenly considers the printf output UTF-8 string to be GB2312, causing garbled characters.

3.2 Best Solution

If there are no non-ASCII characters in the string constants, it is recommended that the source file use unsigned UTF-8 encoding to support the earlier compilers.
If the string constants contain non-ASCII characters, it is recommended that the source file use a signed UTF-8 encoding, which will enable most compilers to correctly handle the source character set.

Supplement--
1. Note that the condition is only "no non-ASCII characters in string constants". If a non-ASCII string is obtained from an external file or other path, the unsigned UTF-8 encoded source code file can be done as long as the appropriate string function is selected.
2. VC2010 new "#pragma execution_character_set" is used to explicitly request UTF-8 strings. Because Windows does not have a UTF-8 locale, the usability is small,

Reference Documents--
"ISO/IEC 9899:1999 (C99)". iso/iec,1999. Www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
"C99 standard". Yourtommy. http://blog.csdn.net/yourtommy/article/details/7495033
"Qstring (2)". dbzhang800. http://blog.csdn.net/dbzhang800/article/details/7540905

Source code Download--
Http://files.cnblogs.com/zyl910/testwchar.rar

UTF-8 (VC, GCC, BCB) compatibility test for the source code files of c/C + + compilers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More