Using Unicode and UTF-8 in Linux C Programming

Source: Internet
Author: User

Currently, various Linux distributions support UTF-8 encoding. The current system's language and character encoding settings are saved in some environment variableslocaleCommand to view:

$ localeLANG=en_US.UTF-8LC_CTYPE="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_PAPER="en_US.UTF-8"LC_NAME="en_US.UTF-8"LC_ADDRESS="en_US.UTF-8"LC_TELEPHONE="en_US.UTF-8"LC_MEASUREMENT="en_US.UTF-8"LC_IDENTIFICATION="en_US.UTF-8"LC_ALL=

Common Chinese characters are also in BMP, so the storage of a Chinese character usually occupies three bytes. For example, edit a C program:

# Include <stdio. h> int main (void) {printf ("Hello \ n"); Return 0 ;}

The source file is stored in UTF-8 encoding:

$ od -tc nihao.c 0000000   #   i   n   c   l   u   d   e       <   s   t   d   i   o   .0000020   h   >  \n  \n   i   n   t       m   a   i   n   (   v   o   i0000040   d   )  \n   {  \n  \t   p   r   i   n   t   f   (   " 344 2750000060 240 345 245 275   \   n   "   )   ;  \n  \t   r   e   t   u   r0000100   n       0   ;  \n   }  \n0000107

Among them344 375 240(Hexadecimale4 bd a0Is the UTF-8 code of "you ",345 245 275(Hexadecimale5 a5 bd) Is "good ". Compile it into the target file,"Hello \ n"This string is a string of such Bytes:e4 bd a0 e5 a5 bd 0a 00The Chinese character is still UTF-8 encoded, a Chinese character occupies 3 bytes, which is called multibyte character in C language ). Running this program is equivalent to taking this string of byteswriteThe device file of the current terminal. If the driver of the current terminal can recognize the UTF-8 encoding can print Chinese characters, if the driver of the current terminal cannot recognize the UTF-8 encoding (such as the general character terminal) can print no Chinese characters. That is to say, the work of recognizing Chinese Characters in such a program is neither done by the C compiler norlibcThe C compiler copies the UTF-8 encoding in the source file to the target file,libcIt is a string ending with 0.writeFor the kernel, the work of recognizing Chinese characters is done by the terminal driver.

However, it is not enough to support Chinese characters to this degree. Sometimes we need to operate the characters in the string in the C program, such"Hello \ n"Contains several Chinese characters or characters.strlenIt's not working, becausestrlenOnly 0 bytes at the end of the string. No matter what is stored in the string, the number of bytes is 7. To operate Unicode characters in a program, the C language defines the wide character type.wchar_tAnd some library functions. Add an L before a character constant or string literal value to indicate a wide character constant or a wide string, for exampleWchar_t c = L 'you ';, VariablecThe value is the 31-bit UCS encoding of the Chinese character "you", andL "Hello \ n"It is equivalent{L 'you', l 'hao', L '\ n', 0},wcslenThe function can take the number of characters in a wide string. See the following program:

# Include <stdio. h> # include <locale. h> int main (void) {If (! Setlocale (lc_ctype, "") {fprintf (stderr, "can't set the specified locale! "" Check Lang, lc_ctype, lc_all. \ n "); return 1;} printf (" % ls ", l" \ n "); Return 0 ;}

Wide stringL "Hello \ n"Of course it is stored in the source code as UTF-8 encoding, but the compiler will convert it into four UCS Encoding0x00004f60 0x0000597d 0x0000000a 0x00000000Stored in the target file.60 4f 00 00 7d 59 00 00 0a 00 00 00 00 00 00 00, UseodCommand to view the target file should be able to find these bytes.

$ gcc hihao.c$ od -tx1 a.out

printfOf%lsThe conversion description indicates that the following parameters are interpreted as wide strings. It does not end when 0 bytes are seen, but the end is only when the character is encoded as 0 by ucs2.writeTo the terminal, it still needs to be output in Multi-byte encoding so that the terminal driver can recognize it.printfConvert a wide string to a multi-byte string internally and thenwriteGo out. In fact, the C standard does not specify that multi-byte characters must be encoded in UTF-8, you can also use other multi-byte encoding, at runtime according to the environment variables to determine the current system encoding, therefore, it must be called at the beginning of the program.setlocaleGets the encoding settings for the current system, if the current system is UTF-8,printfConvert the NFS encoding to a multi-byte string encoded by the UTF-8.writeGo out. In general, the program is usually encoded with wide characters during internal computation. If you want to save the disk or output it to another program, or send it to another program through the network, multi-byte encoding is used.

From http://learn.akae.cn/media/apas03.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.