Currently, various Linux distributions support UTF-8 encoding. The current system's language and character encoding settings are saved in some environment variableslocale
Command to view:
$ localeLANG=en_US.UTF-8LC_CTYPE="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_PAPER="en_US.UTF-8"LC_NAME="en_US.UTF-8"LC_ADDRESS="en_US.UTF-8"LC_TELEPHONE="en_US.UTF-8"LC_MEASUREMENT="en_US.UTF-8"LC_IDENTIFICATION="en_US.UTF-8"LC_ALL=
Common Chinese characters are also in BMP, so the storage of a Chinese character usually occupies three bytes. For example, edit a C program:
# Include <stdio. h> int main (void) {printf ("Hello \ n"); Return 0 ;}
The source file is stored in UTF-8 encoding:
$ od -tc nihao.c 0000000 # i n c l u d e < s t d i o .0000020 h > \n \n i n t m a i n ( v o i0000040 d ) \n { \n \t p r i n t f ( " 344 2750000060 240 345 245 275 \ n " ) ; \n \t r e t u r0000100 n 0 ; \n } \n0000107
Among them344 375 240
(Hexadecimale4 bd a0
Is the UTF-8 code of "you ",345 245 275
(Hexadecimale5 a5 bd
) Is "good ". Compile it into the target file,"Hello \ n"
This string is a string of such Bytes:e4 bd a0 e5 a5 bd 0a 00
The Chinese character is still UTF-8 encoded, a Chinese character occupies 3 bytes, which is called multibyte character in C language ). Running this program is equivalent to taking this string of byteswrite
The device file of the current terminal. If the driver of the current terminal can recognize the UTF-8 encoding can print Chinese characters, if the driver of the current terminal cannot recognize the UTF-8 encoding (such as the general character terminal) can print no Chinese characters. That is to say, the work of recognizing Chinese Characters in such a program is neither done by the C compiler norlibc
The C compiler copies the UTF-8 encoding in the source file to the target file,libc
It is a string ending with 0.write
For the kernel, the work of recognizing Chinese characters is done by the terminal driver.
However, it is not enough to support Chinese characters to this degree. Sometimes we need to operate the characters in the string in the C program, such"Hello \ n"
Contains several Chinese characters or characters.strlen
It's not working, becausestrlen
Only 0 bytes at the end of the string. No matter what is stored in the string, the number of bytes is 7. To operate Unicode characters in a program, the C language defines the wide character type.wchar_t
And some library functions. Add an L before a character constant or string literal value to indicate a wide character constant or a wide string, for exampleWchar_t c = L 'you ';
, Variablec
The value is the 31-bit UCS encoding of the Chinese character "you", andL "Hello \ n"
It is equivalent{L 'you', l 'hao', L '\ n', 0}
,wcslen
The function can take the number of characters in a wide string. See the following program:
# Include <stdio. h> # include <locale. h> int main (void) {If (! Setlocale (lc_ctype, "") {fprintf (stderr, "can't set the specified locale! "" Check Lang, lc_ctype, lc_all. \ n "); return 1;} printf (" % ls ", l" \ n "); Return 0 ;}
Wide stringL "Hello \ n"
Of course it is stored in the source code as UTF-8 encoding, but the compiler will convert it into four UCS Encoding0x00004f60 0x0000597d 0x0000000a 0x00000000
Stored in the target file.60 4f 00 00 7d 59 00 00 0a 00 00 00 00 00 00 00
, Useod
Command to view the target file should be able to find these bytes.
$ gcc hihao.c$ od -tx1 a.out
printf
Of%ls
The conversion description indicates that the following parameters are interpreted as wide strings. It does not end when 0 bytes are seen, but the end is only when the character is encoded as 0 by ucs2.write
To the terminal, it still needs to be output in Multi-byte encoding so that the terminal driver can recognize it.printf
Convert a wide string to a multi-byte string internally and thenwrite
Go out. In fact, the C standard does not specify that multi-byte characters must be encoded in UTF-8, you can also use other multi-byte encoding, at runtime according to the environment variables to determine the current system encoding, therefore, it must be called at the beginning of the program.setlocale
Gets the encoding settings for the current system, if the current system is UTF-8,printf
Convert the NFS encoding to a multi-byte string encoded by the UTF-8.write
Go out. In general, the program is usually encoded with wide characters during internal computation. If you want to save the disk or output it to another program, or send it to another program through the network, multi-byte encoding is used.
From http://learn.akae.cn/media/apas03.html