Character Set: How to (in a program) add and use Unicode for foreign language support

Last Update:2018-12-04 Source: Internet

Author: User

Tags string back

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How to (in a program) add and use Unicode for foreign language support

Level: elementary

Thomas W. Burger (twburger@bigfoot.com) Thomas Wolfgang burger Consulting's boss

August 01, 2001

As a computer's multi-character representation system, Unicode supports encoding and conversion for all languages in the world. This article illustrates the importance of international language support in Linux applications and the idea of planning Unicode support and integrating it into Linux applications.

Unicode is not just a programming tool, but also a political and economic tool. Applications that do not support the world's language can only be used by individuals who can read and write ASCII Supported languages. This frees most people in the world from Computer Technology Based on ASCII. Unicode allows programs to use any character set in the world, so it supports all languages.

Unicode allows programmers to provide software that can be used in their own languages to ordinary people. In this way, you do not need to learn another foreign language, and it is easier to achieve the social and financial interests of computer technology. It is easy to imagine that if you have to learn Urdu for an Internet browser, you will not be able to see how computers are used in the United States. The Web will not appear.

Linux supports Unicode to a large extent. Unicode support is embedded into the kernel and code development library. To a large extent, using a few simple commands in the program can automatically combine them into the code.

All modern character sets are based on the American Standard Code for information interchange (ASCII) published in ansix3.4 in 1968 ). An exception worth noting is the Extended Binary interchange code (ebcdic) of IBM defined before ASCII ). ASCII is a coded character set (CCS) Character Set. In other words, it is a ing from an integer to a character. The ASCII character set allows an eight-bit (Binary-based, expressed with a value of 0 or 1) field or byte (2 ^ 8 = 256) to represent 256 characters. This is a highly restricted encoding character set. It cannot represent all characters in many different languages (such as Chinese and Japanese), nor scientific symbols, nor ancient texts (mysterious symbols and hieroglyphics) and music symbols. Changing the length of a byte can enable the encoding of a larger character set, which seems effective but completely impractical. All computers are based on eight bytes. The solution is a character encoding scheme (character encoding scheme, ces), which can represent a number larger than 256 with a fixed length or extended multi-byte sequence. these values are then mapped to the characters they represent through the encoding character set.

Unicode Definition

Unicode is usually used as a general term involving the dual-byte character encoding scheme. Unicode CCS 3.1 is officially referred to as the ISO10646-1 Universal multi-eight-character set (Universal multiple octet coded character set, UCS ). Unicode 3.1 adds 44,946 new encoding characters. The Unicode 3.0 version contains 49,194 characters, totaling 94,140 characters.

The Unicode Character Set uses a four-dimensional encoding space consisting of 128 3D groups. Each group contains 256 two-dimensional planes. Each plane consists of 256 one-dimensional rows, and each row has 256 units. Each unit encodes a character in the encoding space, or is declared as unused. This encoding concept is called a UCS-4; four octal elements are used to represent each character of a specified group, plane, row, and unit.

The first plane (the 00th plane of the 00th group) is the basic multi-language plane (Basic multilingual plane, BMP ). BMP defines common characters by letters, syllables, ideographic symbols, and various symbols and numbers. The subsequent planes are used to append characters or other uninvented encoding entities. We need this complete range to process all the languages in the world, especially some East Asian languages with nearly 64,000 characters.

BMP is used as the double byte encoding character set, which is determined to be in ISO 10646 UCS-2 format. ISO 10646 UCS-2 is Unicode (and both are the same ). BMP contains 256 rows, as in all ucus. Each row contains 256 units. The characters are encoded only according to the eight-bit elements of the rows and units in BMP. This allows 16-bit encoding characters to be used to write the most important commercial language. The UCS-2 does not require code page switching, code extension, or code status. UCS-2 is an easy way to combine Unicode into software, but it only supports Unicode BMP.

If you want to use eight bytes to represent a character encoding system (character coding system, CCS) with more than 2 ^ 8 = 256 characters, you need a character encoding scheme (character-encoding scheme, CES ).

Back to Top

Unicode Conversion

In UNIX, the most commonly used character encoding scheme is a UTF-8. It takes into account full and comprehensive support for the entire Unicode page, and it can still correctly recognize ASCII. In addition to UTF-8, other options are: UCS-4, UTF-16, UTF-7.5, UTF-7, scsu, HTML, and Java.

Unicode conversion format (utfs) is a character encoding scheme that supports Unicode by ing values in Multi-byte encoding. This article will analyze the most popular format-UTF-8 character encoding system.

UTF-8

The UTF-8 conversion format is gradually becoming a dominant way of exchanging international text information because it supports all the languages in the world and is also ASCII compatible. The UTF-8 uses variable length encoding. The characters from 0 to 0x7f (127) encode itself into a single byte, and the characters with a larger value are encoded into 2 to 6 bytes.

Table 1. UTF-8 Encoding

0x00000000-0x0000007f:		0Xxxxxxx
0x00000080-0x000007ff:		110XXXXX10Xxxxxx
0x00000800-0x0000ffff:		1110Xxxx10Xxxxxx10Xxxxxx
0x00010000-0x001fffff:		11110Xxx10Xxxxxx10Xxxxxx10Xxxxxx
0x00200000-0x03ffffff:		111110Xx10Xxxxxx10Xxxxxx10Xxxxxx10Xxxxxx
0x04000000-0x7fffffff:		1111110X10Xxxxxx10Xxxxxx10Xxxxxx10Xxxxxx10Xxxxxx

Byte 10XxxxxxIs an extended byte, itsXxxxxxThe bit location is filled by the bit of the character code number in binary format. This is the shortest possible multi-byte sequence that represents the code to be used.

Example of UTF-8 Encoding

Unicode Character copyright character 0xa9 = 1010 1001 encoded in UTF-8 as follows:

11000010 10101001 = 0xC2 0xA9

The "not equal to" symbol character 0x2260 = 0010 0010 0110 encoding is as follows:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

Obtaincontinuation byteYou can see the raw data:

[1110]0010 [10]001001 [10]100000 0010 001001 100000 0010 0010 0110 0000 = 0x2260

The first byte defines the number of octal elements followed by it. If it is 7f or smaller, this is the equivalent ASCII value. Each eight-byteXxxxxxMake sure that the byte is not mixed with the ASCII value.

Back to Top

UTF support

Before using the UTF-8 on Linux platform, be sure that there are glibc 2.2 and xfree86 4.0 or newer versions in the distribution package. Earlier versions lack UTF-8 language environment support and ISO10646-1 X11 fonts.

Before the release of the UTF-8, Linux users used extended ASCII in a variety of languages, such as European users with ISO 8859-1 or ISO 8859-2, Greek users with ISO 8859-7, russian users use KOI-8/ISO 8859-5/cp1251 (spanish letter ). This causes many problems in data exchange and requires the compilation of application software for the differences between these encodings. This language is not well supported and data exchange has not been tested. Major Linux publishers and application developers are working to make Unicode, represented primarily in UTF-8 format, a standard in Linux.

To identify Unicode files, Microsoft recommends that all Unicode files start with Zero Width nobreak space (U + feff. This is used as a feature or byte-order mark (BOM) to identify the encoding and byte order used in files. However, Linux/Unix does not use BOM because it will break the syntax conventions of existing ASCII files. In the POSIX system, the selected language environment identifies the expected encoding formats of all input and output files in a process.

There are two ways to add UTF-8 support to a Linux application. The first method is to store data in various places in the form of a UTF-8, so that software changes are rarely (passive ). Another way is to convert the read UTF-8 data into a wide character array (converted) using a standard C-language library function ). Functionwcsrtombs()To convert the string back to the UTF-8:

Listing 1. wcsrtombs ()

#include <wchar.h> size_t wcsrtombs (char *dest, const wchar_t **src, size_t len, mbstate_t *ps);

The method selection depends on the nature of the application. Most applications can operate in a passive way. That's why UTF-8 is so popular on UNIX platforms. ImagecatAndechoSuch programs do not need to be modified. The byte stream is still only a byte stream and does not process it. ASCII characters and control code are not changed in the UTF-8 language environment.

The program that counts characters by byte count requires some minor changes. In the UTF-8, the application does not count any extended bytes. If the UTF-8 language environment is selectedstrlen(s)Function is requiredmbstowcs()Function to replace:

Listing 2. mbstowcs () function

#include <stdlib.h>size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);

strlenIs to estimate the display width. Chinese characters and other ideographic symbols occupy two columns.wcwidth()Function is used to test the display width of each character:

Listing 3. wcwidth () function

#include <        wchar.h> int wcwidth(wchar_t wc);

Back to Top

Unicode C Language Support

In general, since GNU glibc 2.2, the wchar_t type is used only for 32-bit ISO 10646 format values, and is irrelevant to the current language environment. Use the _ stdc_iso_000046 _ macro defined by ISO c99 as a signal notification application. The definition of _ stdc_iso_000046 _ indicates that wchar_t is Unicode. The exact value is a constant in the yyyymml format in decimal format. For example, use:

Listing 4. wchar_t indicates Unicode

#define __STDC_ISO_10646__ 200104L

It indicates that the value of the wchar_t type is represented by the character encoding defined by ISO/IEC 10646 and all corrections and technical errata up to the specified year and month.

The use of wchar_t is shown in this example. Use a macro to determine how to write double quotation marks in ISO c99 portable code.

Listing 5. How to Write double quotation marks

#if __STDC_ISO_10646__     printf("%lc", 0x201c);  #else     putchar('"');  #fi

Language Environment

The proper way to activate the UTF-8 is the POSIX language environment mechanism. A language environment is a configuration setting that includes cultural conventions related to software behaviors. It contains character encoding, date/time symbols, classification rules, and measurement systems. The name of the language environment generally consists of the ISO 639-1 language, ISO 3166-1 country or region code, and optional encoding names and other delimiters. You can use the commandlocale -aObtain the list of all language environments installed on the system (usually in/usr/lib/locale /).

If there is no pre-installed UTF-8 language environment, you can uselocaledefCommand to generate it. To generate and activate a German-speaking UTF-8 language environment for a particular user, use the following statement:

Listing 6. Generating a language environment for a specific user

localedef -v -c -i de_DE -f UTF-8 $HOME/local/locale/de_DE.UTF-8export LOCPATH=$HOME/local/localeexport LANG=de_DE.UTF-8

Sometimes it is useful to add a UTF-8 language environment to all users. The root user can use the following command:

Listing 7. Generating a language environment for each user

localedef -v -c -i de_DE -f UTF-8 /usr/share/locale/de_DE.UTF-8

To set the language environment as the default value for each user, you can add the following lines to the/etc/profile file:

Listing 8. Set the default language environment for all users

export LANG=de_DE.UTF-8

The function behavior for processing multi-byte character code sequences depends on the lc_ctype class of the current language environment. It determines the multi-byte encoding that depends on the language environment. The value lang = de_de (German) causes the output to be formatted according to ISO 8859-1. The value lang = de_DE.UTF-8 will format the output into a UTF-8. The language environment settings will causeprintfIn%lsFormat specifier callwcsrtombs()Function to convert the parameter string of a wide character to multi-byte encoding dependent on the language environment. The country or region identifiers in the language environment, such as lc_ctype = en_GB (UK English) and lc_ctype = en_au (Australian English), are only in the lc_monetary category, the reason is that the name of the currency is different from the number of printed currencies.

Set the environment variable Lang for your preferred language environment. When a C program is executedsetlocale()Function time:

Listing 9. setlocale () function

#include <stdio.h>#include <locale.h>//char *setlocale(int category, const char *locale);int main(){  if (!setlocale(LC_CTYPE, ""))   {    fprintf(stderr, "Locale not specified. Check LANG, LC_CTYPE, LC_ALL.");    return 1;  }

The C language library tests the environment variables lc_all, lc_ctype, and Lang in sequence. The first environment variable with a value determines the language environment data to be loaded for the lc_ctype category. Data in the language environment is split into independent categories. Lc_ctype defines character encoding, while lc_collate defines the sorting order. We use the Lang environment variable to set the default language environment for all categories, but the LC _ * variable can be used to overwrite a single category.

You can use the commandlocale charmapQuery the name of the character encoding in the current language environment. If you have successfully selected the UTF-8 language environment from the lc_ctype category, the UTF-8 is output. Commandlocale -mProvides a list of all character encoding names that have been installed.

If you use a dedicated multi-byte function in the C language library to complete conversion between all external character encoding and the internally used wchar_t encoding, the C language library will be responsible for this, use the correct encoding method according to lc_ctype. This does not even require the program to be explicitly encoded into the current multi-byte encoding.

If an application needs to explicitly support the UTF-8 (or other encoding) conversion method without the libc multibyte function, the application must determine whether the UTF-8 mode needs to be activated. The X/Open compatible system with the header file of the <langinfo. h> library can use the following code:

Listing 10. checking whether the current language environment uses UTF-8 Encoding

BOOL utf8_mode = FALSE;if( !  strcmp(nl_langinfo(CODESET), "UTF-8")   utf8_mode = TRUE;

To check whether the current language environment uses UTF-8 encoding. You must callsetlocale(LC_CTYPE, "")Function, which is used to set the language environment based on environment variables. The nl_langinfo (codeset) function is also composedlocale charmapCommand to find the encoding name specified in the current language environment.

Another method that can be used is to query the language environment variables:

Listing 11. querying language environment variables

char *s;BOOL utf8_mode = FALSE;if ((s = getenv("LC_ALL")) || (s = getenv("LC_CTYPE")) || (s = getenv ("LANG"))) {   if (strstr(s, "UTF-8"))      utf8_mode = TRUE;}

This test assumes that the UTF-8 language environment name has a value of "UTF-8", but the actual situation is not always the case, so you should usenl_langinfo()Method.

Back to Top

Summary

To support all languages in the world, a character encoding system with an eight-byte character encoding policy is required. It must have more characters than ASCII (an extended version that uses unsigned bytes) it must be 2 ^ 8 = 256 characters long. Unicode is such a character encoding system. It has a four-dimensional encoding space consisting of 128 3D groups (with 94,140 defined character values supported by a large number of character encoding schemes, the more popular character encoding scheme in Linux is the Unicode conversion format UTF-8.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More