A Brief Analysis of Linux internationalization and localization mechanism original address: http://www.oschina.net/question/12_7648
Linux is an international operating system. Its toolkit and device drivers support multi-language operations. This article analyzes the functions and command tool sets for implementing internationalization and localization mechanisms in glibc, as well as the internationalization and localization mechanisms from the perspectives of program developers, translators and users, so as to better understand and use locale.
What is globalization and localization?
Because of cultural differences, different countries and regions have different representation methods, such as date, time, and currency symbols. The most obvious difference is language. Sometimes, when writing software for users, developers, maintainers, and end users may come from different regions and require them to use the same language, therefore, when a program or software is written to people around the world, it is usually divided into two parts: internationalization (i18n, this is because the start and end of the word contain 18 letters) and localization (l10n ).
Internationalization means that a program or software can be used by a specific group without modifying or re-compiling the source code. In Iso c, international work depends on locales. Program developers can internationalize their programs in a variety of ways, but GNU gettext has become one of the standards.
Localization refers to a program or software that provides language information in a specific region of a program to adapt to the use of people in a specific region in terms of information input/output processing. Some language environment variables used by the program can be dynamically configured during program execution.
Simply put, internationalization is a developer task and a general process, while localization is what translators do and a specific process. Internationalization provides localization capabilities. Sometimes we call it NLS for internationalization and localization. Glibc (gnu c library), as the C standard library of Linux, provides the foundation for Linux to process internationalization and localization. 1 shows that program processing on Linux depends on glibc.
Figure 1. Basic functions of glibc on Linux
Use and set the system locale
For users, the function used to control the effectiveness of the language or regional environment is called locale. Locale is an important part of glibc and also an important foundation for Linux internationalization and localization. Locale sets a series of environment variables to meet users' requirements for internationalization and localization. Through the locale command, we can not only view the current settings of the language environment, but also view the names and character sets available for the current locale.
List 1. Current locale environment variable value of the system
$ Locale lang = en_US.UTF-8 # The default lc_ctype = zh_CN.UTF-8 used when no lc_xxx variable is set # specify the character classification and processing method lc_numeric = "en_US.UTF-8" # specify to use a region the non-currency numeric format lc_time = "en_US.UTF-8" # specifies the date and time format lc_collate = "en_US.UTF-8" # specifies the sort rules lc_monetary = "en_US.UTF-8 "# specify the format lc_messages = "en_US.UTF-8" # specify the format lc_paper = "en_US.UTF-8" # specify the paper size lc_name = "en_US.UTF-8" # specify the use of a region "# specifying the name Writing Method lc_address =" en_US.UTF-8 "# specifying the address format and location information lc_telephone =" en_US.UTF-8 "# specifying the phone number format for a region lc_measurement = "en_US.UTF-8" # specify the use of a region's Scale Rules lc_identification = "en_US.UTF-8" # overview of locale's own information lc_all = # to overwrite the value of all other lc_xxx Variables |
Division
The locpath variable used to specify the available locale name directory is also shown in checklist 1. The default path is/usr/lib/locale /. In addition, when a program looks for the locale environment variable value, it will use the variable in the following order of priority.
Listing 2. Priority of locale-related variables
[1] LANGUAGE [2] LC_ALL [3] LC_xxx [4] LANGGER |
Lc_all is not an environment variable. It is only an available value of setlocale (the prototype of setlocale and Its Parameter category are defined in the header file locale. h) The macro called. Its value can overwrite all other locale settings (if the lc_all value exists, it is not empty ). Lang is used to specify the environment value of a region, while language is used to specify the Primary and Secondary preferences of individuals on the language environment value. Generally, after Lang is set, it is corrected by lc_xxx.
Language = "en_us: en"
Lang = "en_US.UTF-8"
Lc_ctype = "zh_CN.UTF-8"
You can add the preceding content to system initial files such as/etc/profile or/etc/environment to ensure that the system uses the expected language environment immediately after startup. It is worth noting that if locale is set to "C", the language value will be ignored. Therefore, we must set a valid locale name for Lang (or lc_all) instead of "C ".
Listing 3. Available locale names of the current system
$ locale – a C en_AU.utf8 ... POSIX zh_CN.utf8 ... |
In
In checklist 3, we see two special locale names, C and POSIX. Currently, POSIX is only an alias of C. Except C and POSIX, locale names are not standardized. In addition, the output of available names in ticket 3 has been sorted according to the sorting rules specified by lc_collate. In addition, we can see that the locale name has a naming format.
language[_territory[.codeset]][@modifier] |
Where language is the language code defined in the ISO 639-1 standard, territory is the country and region code defined in the ISO 3166-1 standard, codeset is the name of the character set (such as the UTF-8 ), modifier is the modifier of some locale variants. If the expected locale name is not in the above list, you can add it using the command localedef provided by glibc (see
Clear order 4, the command localedef will generate necessary data files in the relevant path ).
Listing 4. Add a fi_FI.UTF-8 by running the localedef command
[1] localedef -f UTF-8 -i fi_FI fi_FI.UTF-8 [2] localedef -f UTF-8 -i fi_FI ./fi_FI |
Method 1 will generate a locale-archive file in the default path, and method 2 will generate a directory in the specified path, which will contain locale-related data. In addition, the command localedef also provides the -- no-archive option, which enables method 1 to generate a directory rather than the locale-archive file. The following describes the effect of the locale environment variable on the time and date format by setting the lc_all and lc_time values, so as to better understand the basic functions of the locale environment variable in the system (see
Clear order 5, and make sure that the locale name is valid before running ).
Listing 5. Effect of locale environment variables on system commands
$ Lc_all = en_US.UTF-8 date Thu Nov 5 14:13:36 CST 2009 $ lc_time = fi_FI.UTF-8 date to 5.11.2009 14.13.44 + 0800 $ lc_all = zh_CN.UTF-8 locale-ck lc_time abday = "Day; one; two; 3; 4; 5; 6 "...... date_fmt = "% Y % m % d % A % H: % m: % S % Z" time-codeset = "UTF-8" |
Note that the date Command provided by GNU coreutils includes the following content during implementation, which is the key to internationalization and localization of the date command (see Listing 6 ).
Listing 6. source code snippet of the date command
setlocale (LC_ALL, ""); bindtextdomain (PACKAGE, LOCALEDIR); textdomain (PACKAGE); |
Character and Character Set Processing
Processing of characters and character sets is another important part of Linux's internationalization and localization. In early computer character sets, each character only uses six, seven, or eight bits, but this is obviously not enough for languages similar to Eastern languages, therefore, the dubyte character set or even more bytes appear. There are two important concepts in character set encoding: Internal code and external code. Internal code is the encoding used in computer memory, while external code is the encoding used outside the computer, such as storage and transmission. The common wide character set internal codes include Unicode and ISO 10646 (also called UCOS, namely, the universal character set ). Unicode
The initial design is 16 bits, while ISO 10646 uses 31 bits. These two standards have almost no difference since they were developed. Unicode standards correspond to ISO 10646 implementation level 3 (that is, support for all the combined UCS characters ). While our commonly used UTF-8, the UCS variant format 8, is a variable-length encoding compatible with ASCII encoding and all POSIX file systems.
Generally, programs rely on some classification functions to process letters, numbers, spaces, and other characters, and these functions are affected by the lc_ctype value in the current locale. Two different character processing methods are described in the iso c standard, namely the char type and the wchar_t width character (WC ). Their classification functions are defined in the header files ctype. h and wctype. h. Processing functions for multi-byte strings (MBS) and wide strings (WCS) are defined in the header file wchar. h. Obviously, the wide character classification function is more common because it allows the extension of Character Set classification to exceed its available value. And there are
The character set extensions described in POSIX are implemented in the glibc program localedef. In addition, considering the effect of locale on program characters, glibc provides a more general character set processing function and tool iconv independent of locale (see listing 7 ).
Listing 7. iconv Functions
#include <iconv.h> iconv_t iconv_open(const char *tocode, const char *fromcode); int iconv_close(iconv_t cd); size_t iconv(iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft); |
Next, we can view the content in the file system/proc to observe the memory usage of Linux's internationalization and localization mechanisms (see
Clear ticket 8 ). The/proc/self/maps file viewed by the command cat should be consistent with the environment value in the current region of the system returned by the command locale (that is
Clear ticket 8 and
).
Listing 8. Memory ing and access permissions of the current process
$ cat /proc/self/maps ... 085a2000-085c3000 rw-p 085a2000 00:00 0 [heap] b7a90000-b7acf000 r--p 00000000 08:08 740190 /usr/lib/locale/zh_CN.utf8/LC_CTYPE b7acf000-b7ad0000 r--p 00000000 08:08 729171 /usr/lib/locale/en_US.utf8/LC_NUMERIC b7ad0000-b7ad1000 r--p 00000000 08:08 781364 /usr/lib/locale/en_US.utf8/LC_TIME b7bbd000-b7dbd000 r--p 00000000 08:08 704987 /usr/lib/locale/locale-archive b7dbd000-b7dbe000 rw-p b7dbd000 00:00 0 b7dbe000-b7f1a000 r-xp 00000000 08:08 852866 /lib/tls/i686/cmov/libc-2.9.so ... b7f27000-b7f2e000 r--s 00000000 08:08 704989 /usr/lib/gconv/gconv-modules.cache ... b7f32000-b7f4e000 r-xp 00000000 08:08 827406 /lib/ld-2.9.so ... |
In
The locale-archive file is displayed in ticket 8, for example, the above file is equivalent to other directory files containing locale-related data (such as/usr/lib/locale/en_us.utf8/lc_time ). The file gconv-modules.cache is the iconv configuration cache file generated by the command iconvconfig with the file gconv-modules, but this file does not affect the use of iconv. When the gconv-modules.cache does not exist, iconv tries to open the configuration file gconv-modules.
Brief Introduction to gettext
GNU gettext is designed for program internationalization and localization. It is an important part of the GNU software translation project, it provides a framework for program developers, translators, and even users to generate multilingual information (using some interfaces provided by gettext, the program can be consistent with the system's language environment ). Therefore, gettext can help us quickly internationalize and localize programs or software. The gettext toolkit helps us better manage and maintain translation documents, including:
- A set of conventions on how to write programs to support information classification;
- A directory that supports information classification and file naming management;
- A dynamic library file that supports obtaining translation information;
- A directory that supports information classification and file naming management;
- Independent programs used to manage translation information (or translated files;
- A library file that supports parsing and creating translation information;
- A module designed for Emacs to conveniently set and obtain timestamps.
Figure 2. Use the gettext toolset handler to internationalize and localize
Figure 2 shows how to use
The process of internationalization and localization of the gettext toolset handler. In fact, glibc provides two sets of different interfaces for information translation, but they are not accepted by the POSIX standard. One of the interfaces is gettext, and the other is catgets. Although catgets is defined in the X/open standard, it is decided by the industry, so it may not be reasonable. Compared with the gettext scheme, the disadvantage of catgets is that the third parameter (that is, the Message ID) of the function catgets is unique, this will make it difficult for programmers and translators to manage and maintain character information. Therefore, we recommend that you use
Gettext. Checklist 9 shows some functions related to catgets.
Listing 9. functions related to catgets
#include <nl_types.h> nl_catd catopen(const char *name, int flag); int catclose(nl_catd catalog); char *catgets(nl_catd catalog, int set_number, int message_number, const char *message);
Related functions
Setlocale () is an important function for Linux to implement internationalization and localization. We can use this function in the system tool set (such as the date command.
LIST 6. Both the locale and localedef commands use this function .) Or some software (such as the text editing software gedit) source code to find traces of its existence.
#include <locale.h> char *setlocale(int CATEGORY, const char *LOCALE); |
The setlocale function is used to set or query the current locale value of the program. The category parameter has a total of 13 available values (such as lc_collate), which are defined in the header file locale. H, indicating the integer from 0 to 12 respectively. The second parameter locale should be the locale name, such as the zh_CN.UTF-8, but there are still two special values that can use "and null. If the value of locale is "", the function returns the current environment value of the system (that is, the Environment Value of the program is consistent with that of the system ). If the value of locale is
Null, only the current locale settings of the program are returned. The environment value of the program before setting the locale value is "C" or "POSIX" by default ". For international programs, the setlocale function is always called. To ensure the universality of the program, we usually set locale "". In addition
The other two functions in checklist 6 are defined in the header file libintl. h.
#include <libintl.h> char * textdomain (const char * domainname); |
The function textdomain is used to reset the current domain value for functions such as gettext (). Its Parameter domainname is the expected new domain value. If the domainname parameter is "", the function returns the default value "messages", but it seems that no one is willing to use this value, because it will cause conflicts and confusion between programs. If the domainname value is null, the current domain value of the program is returned (if not set previously, the predefined value "messages" is still returned "). Note that dcgettext ()
Functions with the domainname parameter are not affected by the textdomain function (unless their domainname value is set to null ).
#include <libintl.h> char * bindtextdomain (const char * domainname, const char * dirname); char * bind_textdomain_codeset (const char * domainname, const char * codeset); |
The function bindtextdomain is used to find the path through the given domain. The available information path is specified:
dirname/locale/category/domainname.mo |
Dirname is the dirname parameter. If dirname is null, the function returns the current path value of the Program (/usr/share/locale by default). If dirname is "", the return value is null. Locale is the name of locale, and category is the classification of locale, such as lc_messages. Domainname is the domainname parameter. The bind_textdomain_codeset function is similar to bindtextdomain. Therefore, glibc uses an internal function during implementation.
Set_binding_values and implement the above two functions by controlling the input parameters of the function.
Listing 10. Implementation function bindtextdomain in glibc
/* Intl/bindtextdom. C */static void set_binding_values (domainname, dirnamep, codesetp) const char * domainname; const char ** dirnamep; const char ** codesetp ;{...} /* function bindtextdomain */char * bindtextdomain (domainname, dirname) const char * domainname; const char * dirname; {set_binding_values (domainname, & dirname, null); Return (char *) dirname;}/* function bind_textdomain_codeset */char * struct (domainname, codeset) const char * domainname; const char * codeset; {set_binding_values (domainname, null, & codeset ); return (char *) codeset ;} |
In addition, there is an important function nl_langinfo that can access locale-related information. This function is defined in the langinfo. h header file. Its function is to return information related to locale through a given item.
Listing 11. Functions: nl_langinfo
#include <langinfo.h> char *nl_langinfo(nl_item item); /* locale/nl_langinfo.c */ char * nl_langinfo (item) nl_item item; { return __nl_langinfo_l (item, _NL_CURRENT_LOCALE); }
Simple Example
Through the above description, we have a general understanding of Linux's support for internationalization and localization. Next we will write a small program to get the current system time to better understand the impact of the setlocale function on the entire program (see
Clear the ticket 12 ).
Listing 12. The setlocale function is used to obtain the program at the current system time.
# Include <time. h> # include <locale. h> # include <stdio. h> # define size 80 int main (INT argc, char * argv []) {time_t now; struct TM * timeinfo; struct lconv * LC; char buffer [size]; setlocale (lc_all, ""); printf ("lc_time = % s \ n", setlocale (lc_time, null); printf ("lc_monetary = % s \ n ", setlocale (lc_monetary, null); time (& now); timeinfo = localtime (& now); strftime (buffer, size, "% C", timeinfo ); printf ("Date: % s \ n", buffer); lc = localeconv (); printf ("currency Symbol: % s \ n", LC-> currency_symbol ); return 0;} $ gcc-wall locale-time.c-O locale-time $ lc_all = zh_CN.UTF-8. /locale-time lc_time = zh_CN.UTF-8 lc_monetary = zh_CN.UTF-8 Date: November 05, 2009 Thursday 19: 47 minutes 46 seconds currency Symbol: $ |
In
In ticket 12, we not only show the use of the setlocale function "and null special parameter values, the lconv data structure is also used to print the currency symbols consistent with the region (the lconv struct and related functions related to the number and currency Rule Information localeconv are defined in the header file locale. in H, it is related to lc_numeric and lc_monetary in locale ). During program execution, we dynamically modified the value of the locale environment variable to better observe the effect of the setlocale function on the Program (as described above)
The date command is the same, see
Clear the ticket 5 ). In a program that does not call the setlocale function, the program uses the default environment value "C" or "POSIX ". As we mentioned above, glibc provides two sets of different interfaces to implement program internationalization and localization. To better understand these two methods, we will present them separately below (see
Clear orders 13 and
Clear the ticket 15 ).
Listing 13. Example of gettext
# Include <locale. h> # include <libintl. h> # include <stdio. h> # define package "gettext-Hello" # define localedir "po" # define N _ (msgid) gettext (msgid) int main (INT argc, char * argv []) {setlocale (lc_ctype, "zh_CN.UTF-8"); setlocale (lc_messages, "zh_CN.UTF-8"); bindtextdomain (package, localedir); textdomain (Package);/* Translators: here is only a comment */printf (N _ ("are you OK? \ N "); Return 0 ;} |
We specify the name of locale as "zh_CN.UTF-8" to make it easier for us to create a directory for testing, but more often we use "" as a parameter value to adapt the program to different language (region) environments. In addition, we need to make a translation for this simple program and generate an available Binary Translation file. We use the tools xgettext and msgfmt provided by GNU gettext to generate translation files. However, this tool is only used to create and maintain Po (Portable Object) and Mo (machine object) files) some tool sets of the file (see
Clear the ticket 14 ).
Listing 14. Execute the gettext sample program
$ Xgettext -- add-comments -- keyword = N _ gettext-hello.c-O \> gettext-hello.pot -- from-code = UTF-8 $ CP gettext-hello.pot gettext-hello.po $ cat locale-hello.po... "Content-Type: text/plain; charset = UTF-8 \ n" "content-transfer-encoding: 8bit \ n "... #. translators: Here is only a comment #: locale-hello.c: 23 msgid "are you OK? \ N "msgstr" are you okay? \ N "$ mkdir-P po/zh_CN.UTF-8/lc_messages/$ msgfmt gettext-hello.po-O gettext-hello.mo $ MV gettext-hello.mo po/zh_CN.UTF-8/lc_messages/$ gcc-wall gettext-hello.c-O gettext-Hello $ lc_all = zh_CN.UTF-8. /gettext-Hello, are you okay? |
Listing 15. catgets example
#include <nl_types.h> #include <locale.h> #include <stdio.h> #define CATALOG_NAME "catgets-hello.cat" int main (int argc, char *argv[]) { nl_catd catd; setlocale (LC_ALL, ""); printf ("LC_MESSAGES = %s\n", setlocale (LC_MESSAGES, NULL)); catd = catopen (CATALOG_NAME, NL_CAT_LOCALE); if(catd == (nl_catd) -1) { perror("catopen"); return 1; } int set_no=11; int msg_id=14; printf("%s\n", catgets (catd, set_no, msg_id, "Are you OK?")); if(catclose(catd) < 0) { perror ("catclose"); return 1; } return 0; } |
We added error handling in the catgets sample code, but this is only for better display. Generally, this is not necessary because we should try to keep the program running instead of interrupting it. After editing the translation files required by the program, we will execute this simple catgets example (see
Clear ticket 16 ).
Listing 16. Run the catgets sample program
$ Cat catgets-hello.msg... $ set 11 14 are you okay? 15 I am fine, thanks .... $ gencat catgets-hello.msg-O catgets-hello.cat $ MV catgets-hello.cat po/zh_CN.UTF-8/lc_messages/$ gcc-wall catgets-hello.c-O catgets-Hello $ export nlspath = po/% L/lc_messages/% N $ lc_all = zh_CN.UTF-8. /catgets-Hello lc_messages = zh_CN.UTF-8 are you okay? |
Run the gencat command to generate the Binary Translation file required by the program and use nlspath to specify the location where the file is stored. If this variable is not specified, the default position is as follows.
[1] prefix/share/locale/%L/%N [2] prefix/share/locale/%L/LC_MESSAGES/%N |
Here, % L is used to specify the locale name, and % N is the file name. For the above examples and commands, a better execution method is to follow up through debugging tools such as strace, which not only familiarized with the functions of related functions, but also better understand the Linux internationalization and localization mechanisms.
Conclusion
From the concept of internationalization and localization to the glibc source code of Linux's internationalization and localization mechanism, we try to understand the operation of Linux's internationalization and localization mechanism from multiple perspectives, finally, we have written some examples, but some of the content is still ignored, such as the multilingual environment and the international environment of the X Window System. In addition, a noteworthy project is uclibc, which is a C library designed for Embedded Linux. It is much smaller than glibc but has slightly different functions and implementations.
Download this article code
Tags:Gettext supplementary topic description»