Using Unicode and UTF-8 in Linux C Programming

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Currently, various Linux distributions support UTF-8 encoding. The current system's language and character encoding settings are saved in some environment variableslocaleCommand to view:

$ localeLANG=en_US.UTF-8LC_CTYPE="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_PAPER="en_US.UTF-8"LC_NAME="en_US.UTF-8"LC_ADDRESS="en_US.UTF-8"LC_TELEPHONE="en_US.UTF-8"LC_MEASUREMENT="en_US.UTF-8"LC_IDENTIFICATION="en_US.UTF-8"LC_ALL=

Common Chinese characters are also in BMP, so the storage of a Chinese character usually occupies three bytes. For example, edit a C program:

# Include <stdio. h> int main (void) {printf ("Hello \ n"); Return 0 ;}

The source file is stored in UTF-8 encoding:

$ od -tc nihao.c 0000000   #   i   n   c   l   u   d   e       <   s   t   d   i   o   .0000020   h   >  \n  \n   i   n   t       m   a   i   n   (   v   o   i0000040   d   )  \n   {  \n  \t   p   r   i   n   t   f   (   " 344 2750000060 240 345 245 275   \   n   "   )   ;  \n  \t   r   e   t   u   r0000100   n       0   ;  \n   }  \n0000107

Among them344 375 240(Hexadecimale4 bd a0Is the UTF-8 code of "you ",345 245 275(Hexadecimale5 a5 bd) Is "good ". Compile it into the target file,"Hello \ n"This string is a string of such Bytes:e4 bd a0 e5 a5 bd 0a 00The Chinese character is still UTF-8 encoded, a Chinese character occupies 3 bytes, which is called multibyte character in C language ). Running this program is equivalent to taking this string of byteswriteThe device file of the current terminal. If the driver of the current terminal can recognize the UTF-8 encoding can print Chinese characters, if the driver of the current terminal cannot recognize the UTF-8 encoding (such as the general character terminal) can print no Chinese characters. That is to say, the work of recognizing Chinese Characters in such a program is neither done by the C compiler norlibcThe C compiler copies the UTF-8 encoding in the source file to the target file,libcIt is a string ending with 0.writeFor the kernel, the work of recognizing Chinese characters is done by the terminal driver.

However, it is not enough to support Chinese characters to this degree. Sometimes we need to operate the characters in the string in the C program, such"Hello \ n"Contains several Chinese characters or characters.strlenIt's not working, becausestrlenOnly 0 bytes at the end of the string. No matter what is stored in the string, the number of bytes is 7. To operate Unicode characters in a program, the C language defines the wide character type.wchar_tAnd some library functions. Add an L before a character constant or string literal value to indicate a wide character constant or a wide string, for exampleWchar_t c = L 'you ';, VariablecThe value is the 31-bit UCS encoding of the Chinese character "you", andL "Hello \ n"It is equivalent{L 'you', l 'hao', L '\ n', 0},wcslenThe function can take the number of characters in a wide string. See the following program:

# Include <stdio. h> # include <locale. h> int main (void) {If (! Setlocale (lc_ctype, "") {fprintf (stderr, "can't set the specified locale! "" Check Lang, lc_ctype, lc_all. \ n "); return 1;} printf (" % ls ", l" \ n "); Return 0 ;}

Wide stringL "Hello \ n"Of course it is stored in the source code as UTF-8 encoding, but the compiler will convert it into four UCS Encoding0x00004f60 0x0000597d 0x0000000a 0x00000000Stored in the target file.60 4f 00 00 7d 59 00 00 0a 00 00 00 00 00 00 00, UseodCommand to view the target file should be able to find these bytes.

$ gcc hihao.c$ od -tx1 a.out

printfOf%lsThe conversion description indicates that the following parameters are interpreted as wide strings. It does not end when 0 bytes are seen, but the end is only when the character is encoded as 0 by ucs2.writeTo the terminal, it still needs to be output in Multi-byte encoding so that the terminal driver can recognize it.printfConvert a wide string to a multi-byte string internally and thenwriteGo out. In fact, the C standard does not specify that multi-byte characters must be encoded in UTF-8, you can also use other multi-byte encoding, at runtime according to the environment variables to determine the current system encoding, therefore, it must be called at the beginning of the program.setlocaleGets the encoding settings for the current system, if the current system is UTF-8,printfConvert the NFS encoding to a multi-byte string encoded by the UTF-8.writeGo out. In general, the program is usually encoded with wide characters during internal computation. If you want to save the disk or output it to another program, or send it to another program through the network, multi-byte encoding is used.

From http://learn.akae.cn/media/apas03.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Unicode and UTF-8 in Linux C Programming

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using Unicode and UTF-8 in Linux C Programming

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support