C + + software development Multi-country Language Solution Summary

Source: Internet
Author: User
Tags ultraedit

briefly summarizes the following methods
    • Unicode as the core
    • Using the GNU GetText
    • QT-based multilingual development tool: QT linguist

Unicode as the core

Reference: http://www.ibm.com/developerworks/cn/linux/l-cn-ccppglb/

The existence of multi-languages, so that programmers spend a lot of time and effort in coding processing, but a variety of garbled problems, such as XML format errors, text display anomalies, parser anomalies, etc. still abound. In particular, in contrast to the JAVA language, C + + has greater difficulty in handling coding problems. In this paper, we avoid the specific similarities and differences of the different coding formats, take the Unicode as the core, take the simplified Chinese as an example, analyze the reasons of the coding problem from the point of view of engineering application, not only propose the solution of the standard library programming, but also combine the project experience to summarize the general idea of dealing with multi-country language coding.

The question is raised

The existence of multiple languages, the existence of different language operating systems, makes the design of multi-language a lot of trouble, the workload of coding is also considerable. The so-called coding problem boils down to the question of what encoding format the binary encoding is parsing. Especially in the hard disk file and memory data conversion, that is, the process of reading and writing, if the wrong encoding format, it will cause garbled. Java language in the string, encoding and other processing to the programmer a more direct, convenient interface, accustomed to use JAVA coding programmer, in C + + for text encoding related operations, often confused. The purpose of this paper is to analyze the relationship of different codes in practice, especially the problem of how to deal with all kinds of coding in UCS-2, GB2312 and UTF8, with three kinds of coding as examples.

Common problems with coding
    • 1. Write the string in memory encoded A into a byte stream in encoded B format into a file
    • 2. A file originally composed of a encoding is read into memory as a byte stream and parsed as a string by encoding B.

In the first case, data changes and distortions can occur.
If you use the Java language, this error is slightly less because there is no concept of wstring in Java, and the encoding used in the in-memory String is Unicode, where the conversion is transparent to the programmer. Whenever you use the input/output method, note the character set selection for the byte stream.
For example, a "standard" string encoded as a Chinese GB2312 is read into memory and dumped into a UTF8 process:

Figure 1. The JAVA processing method of File conversion encoding

However, C + + programming, due to the usual use of char, String type when more, especially for file read and write, basically is the operation of char* type of data. And there is no such function as GetByte (string charsetname) in JAVA, which cannot be re-encoded directly from the character set to get a byte array of the string. At this point, the string we use is generally not Unicode, but it conforms to some kind of encoding table. This makes us often puzzled by the coding problem of string. Suppose there is a UTF8 string "one" (E4 B8 80), and we mistakenly consider it to be in accordance with GB2312 (code A) and convert it to UTF8 (encoded B), the result of which is destructive and the wrong output will never be correctly recognized.
Still take "standard" for example, this is a correct conversion:

Figure 2. File conversion encoding for C + + processing


In the second case, it is more commonly seen. For example: When the browser browses the Web page, the garbled problem occurs, when the XML file is written, the < XML version= "1.0" encoding= "Utf-8"?> but the file contains GB2312 string-this often causes the XML file bad Formatted, which makes the parser error.
In this case, the data is actually correct, as long as the browser chooses the correct encoding, the XML file in the GB2312 conversion to UTF8 or modify the encoding, you can solve the problem.
It is important to note that ASCII characters, or single-byte characters, are generally not affected by the changes in the encoding, the values in all the encoded tables are the same, and the multibyte characters, such as the Chinese language, need to be handled with care.

Encoding Conversion method

General code conversion, direct mapping is not likely to require more work, in most cases, or choose Unicode as the intermediary for conversion.

Using Library functions

As mentioned earlier, the Java String object is in Unicode encoding, so the Java programmer is mainly concerned about reading the code to determine the byte stream, so as to ensure that the correct conversion to Unicode encoding, in contrast, C/s, the external file read out of the data as a character array, Or a string type, whereas Wstring is a Unicode-encoded double-byte array. The commonly used method is the wcstombs, mbstowcs functions of the C standard library, and the MultiByteToWideChar and WideCharToMultiByte functions of the Windows API to accomplish the transfer and transfer of Unicode.

Here, the implementation of the MBS2WCS function illustrates the main process of converting GB2312 to Unicode:

Listing 1. Multi-byte string conversion to wide byte string
wchar_t * MBS2WCS (const char* PSZSRC) {     wchar_t* pwcs = NULL;     intsize = 0;     #ifdefined (_linux_)         setlocale (Lc_all, "ZH_CN. GB2312 ");         mbstowcs (null,pszsrc,0);         New wchar_t [Size+1];         mbstowcs (PWCs, PSZSRC, size+1);         Pwcs[size] = 0;     #else        Size = MultiByteToWideChar (20936, 0, PSZSRC,-1, 0, 0);         if (Size <= 0)             return NULL;         New wchar_t [Size];         MultiByteToWideChar (20936, 0, PSZSRC,-1, pwcs, size);     #endif     returnpwcs;  }

Correspondingly, the wcs2mbs can convert a wide string into a byte stream.

Listing 2. Wide-byte string conversions to multibyte strings
char* wcs2mbs (const wchar_t * wcharstr) {     char* str = NULL;     intsize = 0;     #ifdefined (_linux_)         setlocale (Lc_all, "ZH_CN. UTF8 ");         wcstombs (NULL, WCHARSTR, 0);         New Char [size + 1];         wcstombs (str, wcharstr, size);         Str[size] = ' + ';     #else        Size = WideCharToMultiByte (Cp_utf8, 0, Wcharstr,-1, NULL, NULL, NULL, NULL);         New Char [Size];         WideCharToMultiByte (Cp_utf8, 0, Wcharstr,-1, str, size, NULL, NULL);     #endif     returnstr;  }

The specific use of Linux's setlocale can be found in the documentation of C + +, which relates to many formatting problems such as text, currency units, time, etc. Windows-related code in 20936 and the macro definition Cp_utf8 is GB2312 encoded corresponding to code page[similar code Page parameters can be obtained from MSDN's Encoding Class information].

What you need to highlight here is the second parameter of setlocale, which is different for Linux and Windows:

    • 1. The author uses [country] under Eclipse CDT + MinGW. [CharSet] (such as zh_cn.gb2312 or ZH_CN. UTF8) format does not pass the code conversion test, but you can use the code Page, which can be written as setlocale (Lc_all, ". 20936"). This means that this parameter is not related to the compiler, but to the system definition, while the different operating systems have different definitions for the installed character set.
    • 2. The Linux system can be found in the/usr/lib/locale/path, where the system supports the locale. When converting to UTF8, the [country] part must be zh_cn,en_us. UTF8 can also be converted normally.

In addition, the standard C and Win32 API function return values are different, the standard C returns the wchar_t array or the char array has no string terminator, need to be assigned manually, so the Linux part of the code to be treated differently.

Finally, be aware that the allocated space should be freed after the two functions are called. If you convert the return values of Mbs2wcs and Wcs2mbs to Wstring and string respectively, you can do a delete in their function body, which is omitted for the sake of brevity, but please do not forget the reader.

Third-party libraries

The current third-party tools have been relatively perfect, here are two, the focus of this article is not here, do not do too much discussion.

    • There are third-party iconv projects on Linux, which are also simpler to use, with the essence of Unicode as the intermediary for conversion. You can refer to Iconv related websites.
    • ICU is a well-established international tool. One of the Code Page Conversion features can also support bidirectional conversion of text data from any character set to Unicode. can access its web site

Experimental test

In your code, call the function mentioned in the "Encoding Conversion method" section, convert the gb2312 encoded string to UTF8 encoding, and parse the behavior of its encoding transformation:

Under the English Linux environment, execute the following command:

Export lc_all=zh_cn.gb2312

Then compile and execute the following program (where kanji are written to the source file in the gb2312 environment)

L1:     wstring ws = L "one";  L2:     String s_gb2312 = "one";  L3:     wchar_t * wcs = Mbs2wchar (S_gb2312.c_str ());  L4:     char* cs = wchar2mbs (WCS);

To view the output:

    • L1-1 Wide CHAR:0X04BB
    • L2-2 BYTES:0XD2,0XBB, gb2312 code 0XD2BB
    • L3-the contents of the returned wchar_t array are 0x4e00, which is the Unicode encoding
    • L4-Converts Unicode to UTF8 encoding again, and the output has a character length of 3, which is 0xe4,oxb8,0x80

In L1 line, the execution result is encoded as a 0X04BB, in fact this is a conversion error, if the use of other Chinese characters, such as "Ha", the compilation will not pass. In other words, the Venquan string in the direct declaration is incorrect in the Linux environment, and the compiler is not able to convert it correctly.

While using the same test code under Chinese windows, there will be a difference at L1, and the hexadecimal value of the wchar_t element in WS is 0X4E00, which is the Unicode encoding of the kanji "one".

Experience summary of dealing with coding problems

First, here's a brief explanation of the relationship between Unicode and UTF8: Unicode is not implemented in the same way as it is encoded, and UTF8 is one of its implementations. For example, using UltraEdit to open UTF8 encoded Chinese files, using 16 binary view, you can see that the Chinese counterpart should be Unicode encoding, each of the text length of 2 bytes--ultraedit here has been converted, if you view the binary files directly, you can find is 3 bytes. But the difference between the two lies in the mathematical conversion of Unicode to UTF8. (More on the concepts of Unicode and UTF8, refer to the relevant literature)

Secondly, regarding the choice of third-party library, the project needs should be considered synthetically. The general text character conversion, the system library function already can satisfy the demand, the realization is also very simple, if needs to target the different region language, the text, the custom programming, needs the richer function, certainly chooses the mature third party tool to be able to do more with less.

Finally, there are several general rules to keep in mind that string encoding is correct logically:

    • Encoding Select : Programming for multi-lingual environments to reduce character set conversions using UTF encoding as the principle. The
    • string does not contain encoded information, but the encoding determines the binary content of the string.
    • Read-write consistent : The character set used when reading is consistent with the write-out. If you do not need to change the string content, simply read the string and write it out, it is recommended not to adjust any character set-even if the program uses the system default character set A and the file's actual encoding B does not match, the written string will still be the correct B-code.
    • Read-in known : for strings that must be processed, parsed, or displayed, the encoding must be known from the time the file is read, and the code to avoid handling the string is simple to use the system default character set, even for memory strings that the program collects from the system. You should also know the encoding format in which it fits--typically the system default character set.
    • Avoid direct use of Unicode : This is the use of non-ASCII encoded 16 binary or 10 binary values with the, including, for example, the Chinese "one" written "& #4e00;" 。 The essence of this method is that Unicode encoding is written directly to the file. This will not only reduce the generality of the code, the readability of the output file, but also difficult to deal with. For example, French characters in other character sets is greater than 80H of single-byte characters, the program to support the Chinese language, it is likely to be a multi-byte character in the middle of the error.
    • avoid falling into direct character set programming : Internationalization, localization tools are more mature, non-purely coding conversion of programmers do not have to deal with the mapping of different coding table conversion problem.
    • Unicode/utf8 does not solve all garbled problems : Unicode is a set of encodings that unify the world language. However, this does not mean that a UTF8 encoded file that can be displayed normally in one system can be displayed normally in another system. For example, UTF8 encoding or Unicode encoding in Chinese is still unrecognizable in French systems that do not have East Asian language pack support-although both UTF8 and Unicode support it.

Using the GNU GetText

Reference: Http://zh.wikipedia.org/wiki/Gettext

GetText is the GNU internationalization and Localization (i18n) function library. It is often used to write multi-lingual programs.

Development

The program source code needs to be modified in response to the GNU GetText request. Most programming languages have been implemented with the support of character encapsulation. To reduce the amount of input and the amount of code, this feature is usually used in the form of a tag alias _ , so for example the following C language code:

printf(gettext("My name is%s.\ n"), my_name);      

Should write:

printf(_("My name is%s.\ n"), my_name);       

GetText uses the string to find the corresponding translation in other languages, and returns the original content if no translation is available.

In addition to the C language, the GNU GetText also supports C + +, Objective-c,pascal/object pascal,sh scripts, bash scripts, Python,gnu clisp,emacs Lisp,librep,gnu Smallta Lk,java,gnu Awk,wxwidgets (by Wxlocale Class), YCP (YaST2 language), Tcl,perl,php,pike,ruby and R. The usage is similar to the C language.

The Xgettext program generates a . pot file from the source code as a template for translating the content in the source code. A typical. pot file entry should be this:

"My name is%s."

Comments are placed directly before the string to help the translator understand what to translate:

printf(_("My name is%s.\ n"), my_name);       

The comments in this example start with ///and are used to generate the. pot template file for the Xgettext program.

Xgettext--add-comments=///

The comments in the. pot file should be in the following form:

#. Thank you-contributing to this project.  "   My name is%s.  "
translation

The translator needs to work with the . PO file, which is generated by the msginit program from the. pot template file. For example, when initializing a French translation file with msginit , we run the following command:

--input=name.pot

This will create a fr.po in the current directory using the specified Name.pot, and one of the entries should be in the following form:

"My name is%s."

The translator will need to edit the file manually or by using a corresponding mode like Poedit, Gtranslator, or Emacs. After the translation is complete, the file should look like this:

"My name is%s." Je m ' appelle%s.\ n " 

Finally. po files need to be compiled with an . Mo file using msgfmt to use as a publication.

Run

Users of Unix-type operating systems only need to set the environment variables LC_MESSAGES , and the program will automatically read the language information from the appropriate .mo file.

Supplement: The latest version of gettext-0.18.3.2 can be implemented in MSVC multi-lingual

Reference: http://www.aslike.net/showart.asp?id=154

"Usually, the program and its documentation are written in English, and the information that the program interacts with when it runs is also English. This is a fact, not only the GNU software, but also most of the other proprietary software or free software. On the one hand, it is very convenient for developers, maintainers and users from all countries to communicate with each other in a common language. On the other hand, most people are not accustomed to using English relative to their mother tongue, and their daily work is to use their native language as much as possible. Most people would like their computer screen to display less English and show more native speakers. "

"GNU's ' GetText ' is an important step in the GNU translation project, and we rely on it for many other steps. This package provides a set of integrated tools and documentation for programmers, translators, or users. In detail, the GNU GetText provides a set of tools to enable other GNU software to create multilingual information. ..."

GetText's workflow is this: for example, we write a Visual C + + (MSVC) program, usually output information such as printf is Chinese. If we add gettext support to the program and use the GetText function on a string that needs to be interacted with, the program can call the GetText function to get the current language string and replace the current string. Note The run-time substitution.

GNU gettext-0.18.3.2 is the latest version, theGNU official website can be downloaded directly, just do not have Visual C + + (MSVC) available running support Library, only self-compiled, compiled run support library, click here to download .

In Visual C + + (MSVC) using the GNU GetText implementation of multiple languages, you can write a translation function to implement the interface and menu string automatic substitution, the program string can only be manually replaced, so use, with the Delphi and C + + Using the GNU GetText in Builder is almost as quick and easy.

Examples of simple use

A simple example,

#include <stdio.h>
#include <libgnuintl.h>

/* Using GetText typically uses a macro definition similar to the following with a function
* You can simply use GetText (string)
*/
#define _ (s) gettext (s)

/*package is the name of the file that gets the language string (the command entered at run time) */
#define PACKAGE "Default"

int main (int argc, char **argv)
{
/* The following three parameters are required when using GetText
* setlocale
* Bindtextdomain
* Textdomain
*/
SetLocale (Lc_all, "");
Bindtextdomain (Package, "locale");
Textdomain (package);

printf (_ ("hello,gettext!\n"));

return 0;
}

The structure of the language string file:. \locale\ language name \lc_messages\default.mo, as in Simplified Chinese:. \locale\zh_cn\lc_messages\default.mo

Mo files are compiled language string files, the GNU website has the corresponding tool software can be edited and generated;

Click here to download the GNU gettext-0.18.3.2 Run Support library available in Visual C + + (MSVC)

QT-based multilingual development tool: QT linguist

Reference: Http://www.oschina.net/p/qt+linguist

http://www.oschina.net/question/54100_146029

http://www.oschina.net/question/54100_146030

http://devbean.blog.51cto.com/448512/244689

http://devbean.blog.51cto.com/448512/245063

Qt Linguist is a tool for adding multilingual support to applications written in Qt.

Qt-linguist tools are mainly used in the multi-lingual translation process of the project, all the first simple introduction of the entire multi-lingual processing process, and finally introduce the use of linguist.

(a) The QT project is multilingual and two things must be done:

1) Ensure that each user-visible string uses the TR () function.
2) When the application launches, use Qtranslator to load a translation file (. qm).
Usage of TR ():

?
1 caseCheckBox = newQCheckBox(tr("Match &case"));

Load the translation file in the main () function:

?
123456789 int main(int argc, char *argv[]){    QApplication app(argc, argv);    //翻译程序    QTranslator translator;    translator.load("spreadsheet_cn.qm");    app.installTranslator(&translator);    ……}

Note: The location where the translation files are loaded must be done before the interface is instantiated.

(ii) generate. QM Translation Files

1, in the application of the. Pro file to add translations entries, can be corresponding to different languages, such as: Spreadsheet_cn.ts, corresponding to the Chinese, the name can be defined by itself, suffix. TS is non-volatile. <.ts is a readable translation file that uses a simple XML format; the. QM is a binary machine language that is converted into a. ts >

2. Translation of documents. Three steps to complete:
1) Run Lupdate to extract all user-visible strings from the application's source code.
2) use QT linguist to translate the application.
3) Run Lrelease to generate a binary. qm file.
The above three steps are required to the Qt comes with the command line console, starting method: Start---> All Programs--->qt by Nokia v4.6.3 (opensource)--->qt 4.6.3 command Prompt
After starting the command line, enter the following command:
1) Lupdate–verbose Spreadsheet.pro//generate the corresponding. ts file
2) linguist//Start linguist language translation tool, can translate the corresponding visible string

3) Lrelease–verbose Spreadsheet.pro//translate good files into. qm files

(iii) Use of linguist language tools

1) Start: Command line or Start menu can be
2) Open: The file--->open in the tool interface, you can open the required. ts file
3) Translation: The middle of the interface of the translation bar, two lines: the first line: Source Text the second line: ... Translation, in the ground two lines for the corresponding translation can be translated after a click on the "OK Next" button.
4) Publish: Click File--->release to generate the. qm file. (Same as command line effect)

(iv) recommendations for the use of linguist language tools

1, in the code all need to use the Chinese language in the place of a temporary substitution in English, and with the TR () function to mark.

2. Use QT linguist to translate all strings that are marked with the TR () function and publish the translation package.

3. Load the translation package in the program.

Detailed practices can be found in the blog of the Devbean Great God:

QT Learning Pathway (33): Internationalization (UP): http://devbean.blog.51cto.com/448512/244689

QT Learning Pathway (34): Internationalization (bottom): http://devbean.blog.51cto.com/448512/245063

C + + software development Multi-country Language Solution Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.