On the character encoding problem of C + + program _

On the character encoding problem of C + + program __HTML5

Last Update:2018-08-06 Source: Internet

Author: User

Tags gettext locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The source program needs to be compiled by the compiler as the target file, and the target file is run to output information to the terminal, so this
There are some associations between several encodings:

+--------+
 | source code |----------Source Coding
 +---+----+
     | Compiler compilation
 +---+----+
 | target file |----------program code
 +--- +----+
     | After running output Information
 +---+----+
 |  Output  |----------Operating Environment Code
 +--------+

The compiler needs to correctly identify the source file encoding, compile the source file into the target file, and put the source file
String encoded in the source file into a string stored in the program code to save it in the destination file.

Note

When both the character encoding of the source file and the code inside the program are UTF-8 (gcc defaults), GCC looks like
Does not convert the character encoding in the source file, but simply stores the string as it is in the target text
In this case, the GB18030 encoded string in the source program is still GB18030 at the time of the output
Coding. But if the actual value of the other source file character encoding is not the same as the compilation option, it will be in the compile times
Unable to convert from XXX to UTF-8 error, so it is not clear why both encodings are UTF-8,
GB18030 encoded source files can be compiled.

C + + standard library needs to correctly identify the terminal's operating environment code and convert the program output to a running loop
The code used by the border in order to display correctly.

In this process, if there is a link problem, it will cause the output of the program is abnormal, resulting in chaos
Yards or other more serious consequences. 2. What encoding should be used for the source file. 2.1. Does the compiler support the encoding of different source files?

According to Http://stackoverflow.com/questions/688760/how-to-create-a-utf-8-string-literal-in-visual-c-2008
The information provided in the article, GCC/VC versions of C + + source file encoding have different processing:

GCC (v4.3.2 20081105):

Supports UTF-8 encoded source files, UTF-8 encoded source files cannot have a BOM.

According to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415, it seems that GCC 4.4.0
Start support for UTF-8 files with BOM.

VC2003:

Support UTF-8 encoded source files, UTF-8 encoded source files can have a BOM, you can not.

vc2005+:

If the source file uses UTF-8 encoding, you must have a BOM.

Note	GCC provides the-finput-charset parameter to specify the character encoding of the source file, but due to the standard Header files are ASCII-encoded, so if you are referencing a standard header file, the source code must be encoded Compatible with ASCII. and VC failed to find a similar option.

2.2. What encoding should be used for the source file.

Many articles recommend the use of ASCII characters only in C + + code, and if you have non-ASCII characters you can use \XHH
or \uxxxx said. UTF-8 encoding is recommended in comments. You can also use
GetText the non-ASCII string into a separate
Language file, but preserves only ASCII characters in the source code.

In practice, due to \xhh or \uxxxx and other methods are not intuitive, error prone and not easy to find, but not necessarily
There are programs that need to support multiple languages, so you may not want to introduce gettext or similar solutions. In such
In the case, we are accustomed to write directly in the source program files in Chinese and other non-ASCII characters, which requires
Select a file encoding that can be accepted at least by GCC and VC.

Originally, Unicode is the best choice to solve multiple language problems, and UTF-8, due to its compatibility with ASCII, is
The most common Unicode encoding, but it is visible from the data above, and if you use UTF-8, gcc (
At least a low version) is not allowed to have a BOM, and vc2005 above requirements must have a BOM, so the same file
Unable to compile under GCC and VC, UTF-8 does not seem to be a good choice. But if you use GCC to compare
The high version (4.4.0 above. ), it should also be possible to use a UTF-8 encoding file with a BOM.

In view of the current situation, we generally work in Simplified Chinese windows, the source file using GB18030
Coding seems to be a more realistic choice. Under the VC can be directly compiled, and under GCC can also be through
Increase the compilation options-finput-charset=gb18030 to support. And according to Wikipedia, GB18030
GB18030 is a superset of ASCII and can represent the whole
The range of Unicode code points (GB18030 is backward compatible with ASCII and can represent all
Unicode code points), so using GB18030 has enough expressive power to represent all Unicode
Character. The only disadvantage of using GB18030 is that in a non-Simplified Chinese version of VC, because the source cannot be specified
The encoding of the file, so it is possible that the encoded source file will not be recognized correctly. 3. What procedure should be used within the code.

As mentioned earlier, C + + has a narrow character (char) and wide character (wchar_t), respectively, with a
Sets the corresponding classes and functions (String/cout/strlen and Wstring/wcout/wcslen, etc.). The former in
Different compilers have different default encodings (Simplified Chinese VC is GB18030,GCC is UTF-8), after
People generally use Unicode, where VC use UTF-16,GCC default use UTF-32.

C + + outputs narrow characters in the same code as the program, and does not encode conversions, so it uses a narrow
Characters require the code in the program and the operating environment code consistent, so that will not appear garbled. Due to Simplified Chinese
VC version of the program code is GB18030, so the use of narrow-character VC program can only run in GB18030 environment
。 Similarly, because GCC uses UTF-8 as the default for code, GCC programs that use narrow characters can only
Run in the UTF-8 terminal environment. (This is all in the source code directly written in Chinese and other non-ASCII characters
The program of the character. With the aforementioned GetText and other tools, programs that use narrow characters can also be used in different
The correct output of Chinese in the coded operating environment

C + + is automatically converted to the encoding of the running environment when the output is wide, so as long as the run loop is set correctly
Code, the same program can display Chinese correctly in a different coded operating environment. This point with
java/. NET is very like, java/. NET uses Unicode for string types, which are required in the input/output
The encoding of the current running environment is transferred to each other.

In general, if you need to support multiple languages, there are two better ways to do it:

Uses narrow characters, but only ASCII characters are used in the source program, non-ASCII characters are passed through GetText or other
The tool is placed in a separate file, and the problem of encoding conversion is handled by tools such as GetText.

Chinese can be exported correctly in various coded operating environments.

Non-ASCII characters cannot be directly present in a program, and non-ASCII characters cannot be specified in \uxxxx mode
, the latter is also converted to non-ASCII characters by the compiler and stored in the destination file.

An ASCII-compatible encoding can be used in a note without affecting the compiler.

There are more ready-made code available for reuse.

Use wide characters.

Chinese can be exported correctly in various coded operating environments.

Non-ASCII characters can be used in programs.

Need to match the previous source program file encoding settings, so that the compiler can correctly identify the source program in the non-
ASCII characters.

There is less code available for reuse because there were fewer programs that used to use wide characters.

Note

If the program requires some string constants that are fixed character encodings, such as GB18030
Encoded string constants that should hold string constants in a \XXX way encoded by GB18030
After the content, such content will not be converted to the code of the program, and will not be converted to the running environment code
。

4. What character code should be used for the operating environment.

As mentioned above, the character encoding requirements for a running environment using narrow characters and wide-character programs are
Not the same.

Use wide characters, as long as the character encoding of the current environment is set correctly in the program (generally through
Locale::global (Locale ("")), the C + + standard library will be in the input, output
Do character encoding conversion, so you can adapt to a variety of coded operating environment.

With narrow characters, but not non-ASCII characters in the program, there is no special requirement for the running environment.
Can adapt to various coding of the operating environment.

Use the narrow character, the program also directly uses the Chinese characters and other non-ASCII characters, because the C + + standard library will
The string stored in the destination file (saved in program code) is output directly and will not be converted to character encoding
, it requires that the code for the running environment be the same as the code in the program. Simplified Chinese VC compiled program can only run
In a GB18030 environment, GCC compiled programs can only run in UTF-8 environments (which can be passed at compile time
-fexec-charset parameters are modified). 5. C + + source file encoding option 5.1. Several possible approaches

According to the above discussion, there are several ways to be compatible with WINDOWS/LINUX,VC/GCC.
：

Using narrow characters, the source program uses only ASCII characters, non-ASCII characters, such as Chinese
GetText and other tools are placed in a separate language pack.

This approach is more recommended by many people.

Compatible with VC and GCC versions.

Because non-ASCII characters do not appear in the source program, you do not need to consider the encoding of the source program files.

Compatible with a variety of coded operating environments.

Using narrow characters, non-ASCII characters are allowed in the source program.

Requires that the code for the running environment be the same as the code in the program, that is, only GB18030-encoded windows and
UTF-8 coded Linux.

Depending on the encoding used by the source program, the compiler's compatibility is different:

Using narrow characters, the source program uses UTF-8 encoding with a BOM.

Compatible with the various languages of VC version.

Compatible with GCC version 4.4.0.

Using narrow characters, the source program uses GB18030 encoding.

Compatible with VC version of Simplified Chinese.

Compatible with GCC versions, but you need to add-finput-char=gb18030 parameters at compile time.

Using wide characters, non-ASCII characters are allowed in the source program.

Compatible with a variety of coded operating environments.

Depending on the encoding used by the source program, the compiler's compatibility is different:

Using narrow characters, the source program uses UTF-8 encoding with a BOM.

Compatible with the various languages of VC version.

Compatible with GCC version 4.4.0.

Using narrow characters, the source program uses GB18030 encoding.

Compatible with VC version of Simplified Chinese.

Compatible with GCC versions, but you need to add-finput-char=gb18030 parameters at compile time. 5.2. Recommended Practice

According to our current situation, for the need to support a multilingual program, we recommend that you use narrow characters, the source program only
with ASCII characters.

For programs that don't need to support multiple languages, consider using narrow characters, considering reusing existing code.
GB18030 encoding, but only run in the GB18030 encoded Windows environment and UTF-8 encoded
Linux environment. 6. Other questions 6.1. User input, output and persistence

Because the user input, the output and from the file, network and other facilities read and write data at the bottom of the program appears to be byte
Stream, so there is a way to interpret these byte streams as valid information at the time of input, and how to process
The information in the order is converted to the correct word throttling problem.

If the program itself does not need to process the data, it simply moves the data from one source to another (
If you save user input to a file, or read from one stream, write to another stream, and so on, the input word
Code is the same as the character encoding of the output, the program does not need to do any coding conversion of the data, just
The character encoding of the data is not related to the coding of the program to write the read data as is.

For example, the website application, only need to ensure that the user page uses the UTF-8 code, the database, the data file also
Uses UTF-8 encoding, the data entered by the user can be written directly to the database and data files, from the number
Data read from a library or data file can also be presented directly to the user without the need for a coded conversion.

If the program needs to process the data in a certain program (if you need to judge the number of characters, the word Fu Jin
Row comparison, attaching or removing content on a string, convert the data to an explicit character encoding
, in general, the code in the program, and then processed, after processing and then converted to the required character encoding to lose
Out

For a wide-character program, if you only need to process data that uses the character encoding of the current running environment, you can
The Ios::imbue () allows you to specify the character encoding for the IO stream, and the C + + standard library in input and output
Automatically encodes a coded conversion between the specified character encoding and the program's code. If you do not use the stream, you
The current encoding can be done through the standard wcstombs () or mbstowcs () function (through
Locale::global () or setlocale () specified) and the conversion between wide characters.

For narrow-character programs, if the character encoding of the data is consistent with the code of the program, there is no need to encode.
change, direct processing can be.

For other scenarios, you need to introduce a iconv or similar character-encoding conversion library to implement different
The conversion between character encodings. Alternatives to 6.2 gettext and Iconv

Because GetText and iconv all belong to GNU Project, these libraries are suitable for use by all
programs, especially business processes, taking into account copyright considerations. In Boost 1.48.0, the Boost.locale
Library is released for the first time, the library provides GetText, ICONV features, and builds on it
Row enhancements, providing case-by-case transformations, character-order comparisons, time processing, word segmentation, numeric lattices
The functions of typed input/output, message format, multi-language support and character encoding conversion are worthy of further
research and use.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More