"Turn" about the coding problem of C + + program

Source: Internet
Author: User
Tags gettext locale

Referenced from: http://blog.chinaunix.net/uid-26790551-id-3190813.html

Our traditional programs are basically only run in Windows or Linux only, Windows programs use the simplified
Chinese GB18030 encoding, the Linux program only uses English, for many years these programs have not been running.
Problem.

In recent years, with the component of the program, some code, especially the common components, need to support windows
and Linux platforms, there are different levels of coding problems, such as compiler errors at compile time,
or garbled at run time. These problems are related to the incorrect character encoding chosen by the program.

In this paper, some characters encoding problems of C + + are briefly analyzed, and the proposed scheme is provided. By experience and time
Some of the content may not necessarily be comprehensive and are for your reference only.

1. Does the encoding of C + + source files require special consideration? 1.1. Several related concepts

The first thing to distinguish is a few concepts:

encoding of C + + source files

Refers to what character encodings (Gb18030/utf-8, etc.) are used by the C + + source program file (. cpp/.h) itself.

internal code for C + + programs

After compilation, the string constants in C + + become a bunch of bytes stored in the executable file. This inside
Code refers to what encoding the string is stored in the executable file. The string constants here
Refers to a narrow character (char) rather than a wide character (wchar_t). Wide characters are usually in Unicode (VC
Store using UTF-16BE,GCC using UTF-32BE).

Run Environment Code

Refers to the encoding that is used by the operating system or the terminal when the program is executed. The characters that are output in the program are eventually
Conversion to run environment code to display correctly, otherwise it will appear garbled.

1.2. Codes commonly used in various environmentsencoding of C + + source files

Typically in a simplified Chinese Windows environment, various editors (including visual Studio) Create new files
The default encoding is GB18030, so the encoding of C + + source files under Windows environment is not specifically specified.
It is usually GB18030.

And in the Linux environment, the most commonly used, is also recommended to use UTF-8 encoding.

internal code for C + + programs

In general, we commonly use the Simplified Chinese version of the VC used in the code is GB18030, and gcc/g++ use of
The inner code defaults to Utf-8, but can be modified by the-fexec-charset parameter.

Note You can determine the inner code used by the program by printing the string in hexadecimal form for each byte in the program.
Run Environment Code

The environment code for our commonly used Simplified Chinese version of Windows is GB18030, while the most commonly used environments under Linux
Encoding is UTF-8.

1.3. The relationship between the several encodings

The source program needs to be compiled by the compiler into the target file, the target file runs after the output information to the terminal, so this
There are some associations between several encodings:

+--------+ | source program |----------source file code +---+---- + | Compiler compilation +---+----+ | target file |----------program code +---+----+ | Post-run output information +---+----+ | Output |----------Run Environment code +--------+
    • The compiler needs to correctly identify the source file encoding, compile the source file into the target file, and the source file
      A string converted from a source file to a string encoded in a program code is saved in the destination file.

      Note When the character encoding of the source file and the code inside the program are UTF-8 (the default for GCC), GCC does not seem to convert the character encoding in the source file, but instead directly stores the string as it is in the target text.
      In this case, the GB18030 encoded string in the source program is still GB18030 at output
      Coding. However, if the actual value of the other source file character encoding is not the same as the compile option, the compiler will not be able to convert from XXX to UTF-8 error, so it is unclear why the two encodings are UTF-8, GB18030 encoded source files can be compiled.
    • C + + standard library needs to correctly identify the terminal's operating environment code, and convert the program output to a running ring
      Used by the user to display the code correctly.

In this process, if there is a link problem, it will cause the output of the program is abnormal, resulting in chaos
Code or other more serious consequences.

2. What encoding should the source file take? 2.1. Does the compiler support the same encoding for different source files?

according to http://stackoverflow.com/questions/688760/how-to-create-a-utf-8-string-literal-in-visual-c-2008
The information provided in the article, GCC/VC versions of C + + source file encoding have different processing:

    • GCC (v4.3.2 20081105):

      Support UTF-8 encoded source file, UTF-8 encoded source file cannot have BOM.

      according to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415 , it seems that GCC 4.4.0
      Started support for UTF-8 files with BOM.

    • VC2003:

Support UTF-8 encoded source file, UTF-8 encoded source file can have BOM, also can not.

    • vc2005+:

If the source file uses UTF-8 encoding, you must have a BOM.

Note GCC provides the-finput-charset parameter to specify the character encoding of the source file, but due to the standard
Header files are ASCII-encoded, so the source code must be ASCII-compatible if you want to reference a standard header file. The VC could not find a similar option.
2.2. What encoding should the source file take?

Many articles recommend that only ASCII characters be used in C/s code, and if non-ASCII characters can be used \XHH
or \uxxxx said. It is recommended to use UTF-8 encoding in comments. You can also use gettext to place non-ASCII strings in a separate language file, leaving only ASCII characters in the source code.

In practice, due to \xhh or \uxxxx and other methods are not intuitive, error prone and difficult to find, and may not be
There are programs that need to support multiple languages, so you may not want to introduce gettext or similar solutions. In such
, everyone is accustomed to writing non-ASCII characters such as Chinese in the source program file, which requires
Select a file encoding that can be accepted at least by GCC and VC.

Originally, Unicode was the best choice for solving multi-lingual problems, and UTF-8 because it was compatible with ASCII,
The most common Unicode encoding, but is visible from the data above, if you use UTF-8, gcc (
At least the lower version) does not allow the BOM, and vc2005 above requirements must have a BOM, so the same file
Unable to compile under GCC and VC, UTF-8 does not seem to be a good choice. But if you use GCC to compare
High version (4.4.0 or above?) ), it should also be possible to use a UTF-8 encoded file with a BOM.

In view of the current situation, we generally work in the simplified Chinese windows, the source file used GB18030
Coding seems to be a more realistic choice. VC can be compiled directly, and under the GCC can also be
Add compile option-finput-charset=gb18030 to support. And according to Wikipedia, GB18030
The contents of the entry, GB18030 is a superset of ASCII and can represent the whole
Range of Unicode code points (GB18030 backwards compatible with ASCII and can represent all
Unicode code points), so using GB18030 has enough expressive power to represent all Unicode
Character. The only disadvantage of using GB18030 is that the source cannot be specified under a non-Simplified Chinese version of VC
File is encoded, so it is possible that the source file for this encoding may not be recognized correctly.

3. What program code should I use?

As mentioned earlier, C + + has a narrow character (char) and a wide character (wchar_t), respectively, with one
The corresponding classes and functions (String/cout/strlen and Wstring/wcout/wcslen, etc.). The former in
Different compilers have different default encodings (Simplified Chinese VC is GB18030,GCC is UTF-8), after
Generally use Unicode, where VC uses UTF-16,GCC by default using UTF-32.

C + + outputs narrow characters as if they were code-as-is, and does not encode conversions, so use narrow
Character requires that the code in the program is consistent with the running environment, so that no garbled characters are present. Since Simplified Chinese
VC version of the program code is GB18030, so the use of narrow characters VC program can only run in the GB18030 environment
。 Similarly, because GCC uses UTF-8 as the program code by default, GCC programs that use narrow characters can only
Run in a UTF-8 terminal environment. (It's all about writing non-ASCII words such as Chinese in the source code.)
The program of the character. Using the GetText and other tools mentioned earlier, programs with narrow characters can also be used in different
Correct output in the coded runtime environment)

C + + is automatically converted to the encoding of the running environment when the wide character is output, so as long as the operating ring is set correctly
Code, the same program can display Chinese correctly in different coded operating environments. This point and
java/. NET is like, java/. NET uses Unicode for both the string type and the input/output required
The encoding of the current running environment is transferred to each other.

In general, if you need to support multiple languages, there are two better practices:

  1. uses narrow characters, but only ASCII characters are used in the source program. Non-ASCII characters are placed in a separate file by GetText or other
    tools, and are handled by tools such as gettext to handle the problem of encoding conversion.

    • correctly output in various coded operating environments 。

    • non-ASCII characters cannot appear directly in the program , you cannot specify non-ASCII characters by \uxxxx, which is also converted to non-ASCII characters by the compiler and stored in the destination file.

    • annotations can use ASCII-compatible encodings , and does not affect the compiler.

    • There are more ready-made code to reuse.

  2. use wide characters.

    • correctly output in various coded operating environments 。

    • You can use non-ASCII characters in the program.

    • need to match the previous source program file encoding settings, Allows the compiler to correctly identify non-
      ASCII characters in the source program.

    • because fewer programs have previously used wide characters, There is less code available for reuse.

Note If you need some fixed character-encoded string constants in your program, such as pinning is GB18030
Encoded string constants, which should store the string constants in \XXX-encoded content, so that the content is not converted to the code of the program or to the run-environment encoding.
4. What character encoding should the operating environment use?

As mentioned above, the character encoding requirements for a running environment using narrow characters and programs that use wide characters are
Not the same.

    • With wide characters, as long as the character encoding of the current environment is correctly set in the program (typically set by Locale::global ("")), the C + + standard library will be
      Character encoding conversion, so it can adapt to a variety of coding operating environment.

    • Use narrow characters, but no non-ASCII characters in the program, no special requirements for the operating environment,
      Can adapt to various coding of the operating environment.

    • Non-ASCII characters, such as Chinese characters, are also used directly in the program, as the C + + standard library will
      The string saved in the target file (stored in code) is output directly, and no character encoding conversion is required, so the encoding of the running environment is consistent with the code in the program. That is, Simplified Chinese VC compiled program can only run in the GB18030 environment, GCC compiled programs can only run in the UTF-8 environment (can be modified by-fexec-charset parameters at compile time).

5. C + + source file encoding selection 5.1. Several possible practices

According to the above discussion, it seems that there are several ways to be compatible with WINDOWS/LINUX,VC/GCC.

  1. Using narrow characters, the source program uses only ASCII characters, non-ASCII characters, such as Chinese, etc.
    GetText and other tools are placed in a separate language pack.

    • This approach is more recommended than many people.

    • Compatible with all versions of VC and GCC.

    • Because non-ASCII characters do not appear in the source program, you do not need to consider the encoding of the source program files.

    • Compatible with various coding operating environments.

  2. Using narrow characters, non-ASCII characters are allowed in the source program.

    • Requires that the encoding of the operating environment is consistent with the code within the program, that is, only GB18030 encoded windows and
      UTF-8 encoded Linux.

    • The compiler compatibility differs depending on the encoding used by the source program:

      1. using narrow characters, The source program uses UTF-8 encoding with a BOM.

        • compatible with each version of the VC language.

        • compatible with GCC 4.4.0 or later.

      2. using narrow characters, The source program uses GB18030 encoding.

        • compatible with simplified Chinese versions of VC.

        • compatible with GCC versions, However, the-finput-char=gb18030 parameter needs to be added at compile time.

  3. Using wide characters, non-ASCII characters are allowed in the source program.

    • Compatible with various coding operating environments.

    • The compiler compatibility differs depending on the encoding used by the source program:

      1. using narrow characters, The source program uses UTF-8 encoding with a BOM.

        • compatible with each version of the VC language.

        • compatible with GCC 4.4.0 or later.

      2. using narrow characters, The source program uses GB18030 encoding.

        • compatible with simplified Chinese versions of VC.

        • compatible with GCC versions, However, the-finput-char=gb18030 parameter needs to be added at compile time.

5.2. Recommended Practices

According to our current situation, for programs that require support for multiple languages, it is recommended to use narrow characters, which are only used in the source program.
Use ASCII characters.

For programs that do not need to support multiple languages, consider using narrow characters, considering reusing existing code.
Use GB18030 encoding, but only in GB18030 coded windows environments and UTF-8 encoded
Linux environment.

6. Other issues 6.1. User input, output and persistence

Because the user input, output and from the file, network and other facilities read and write data at the bottom of the program appear to be bytes
Stream, so there is a way to interpret these byte streams as valid information at the input,
The information in the sequence is converted to the correct byte stream problem.

  • If the program itself does not need to process this data, Just move the data from one source to another (
    if the user input is saved to a file, or read from one stream, write to another stream, etc.), and the input character encoding is consistent with the character encoding of the output, the program does not need to do any encoding conversion of the data, just write the read-in data as-is to the output, The character encoding of the data is not related to the coding of the program.

    For example, Web applications, Just to ensure that the user page using UTF-8 encoding, the database, data files are also used UTF-8 encoding, the user input data can be directly written to the database and data files, data from the database or data files can be read directly to the user, do not need to encode conversion.

  • If the program needs to process the data on a certain program (if you need to determine the number of characters, identifier
    Line comparisons, append or remove content from a string, convert the data to a definite character encoding, in general the code within the program, then processed, and then converted to the desired character encoding for output after processing.

    • For wide-character programs, if you only need to process the data that takes the current run environment character encoding, you can specify the character encoding of the IO stream via Ios::imbue (), and the C + + standard library automatically encodes the encoding and code between the specified character encoding and the in-Program codes at input and output. If you are not using a stream, you can also use the standard wcstombs () or mbstowcs () function to convert between the current encoding (through Locale::global () or setlocale () and the wide character.

    • For the narrow character program, if the character encoding of the data is consistent with the code in the program, it does not need to be converted and processed directly.

    • For other scenarios, you need to introduce a iconv or similar character encoding conversion library to implement different
      The conversion between character encodings.

6.2. Alternatives to GetText and Iconv

Since GetText and iconv belong to GNU Project, not all programs, especially commercial programs, are suitable for use with copyright considerations. In the Boost 1.48.0, the Boost.locale Library was officially released for the first time, and the library provided GetText, ICONV, and was enhanced on this basis, providing case transformation, character sequence comparison, time processing, word segmentation, formatted input/output of numbers, message formatting , multi-lingual support, character encoding conversion and other functions, worthy of further research and use.

"Turn" about the coding problem of C + + program

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.