From calling printf () to a character that can see the output

Last Update:2014-12-27 Source: Internet

Author: User

Tags windows support

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0 Introduction

See the simplest C program below:

int main (int argc, char** argv) {printf ("ABC"); return 0;}

This article is trying to describe the execution of this program, specifically, from calling printf (), to "ABC" three characters to display on the display, exactly what kind of process.

1 First stage: printf () Final call to write () to the terminal

Using the Strace trace to execute the above program, you can find that the final result is called write (1, "ABC", 3). The final effect is to write three bytes to the end device.

Now let's try to change the simple ABC into Chinese.

int main (int argc, char** argv) {    printf ("Chinese");    return 0;}

Also using strace trace execution, it was found that the final call was: Write (1, "\344\270\255\346\226\207", 6). For easy viewing, a 16 binary representation is written to the terminal device E4 B8 AD E6 96 87 A total of 6 bytes. Further, it is actually the UTF-8 code of the word "Chinese". The reason for this is UTF-8 encoding, because the C source file itself is encoded using UTF-8. The following we use GBK to the same content of the source code to compile and run, and then use strace tracking, we will find that the final call is: Write (1, "\326\320\316\304", 4), 16 binary, is to write to the terminal D6 D0 CE C4 A total of 4 bytes, and this is the " Chinese "GBK encoding of two words.

Thus, thefinal result of printf ("string") output is to write the string encoding to the terminal device, and how the encoding depends on the source file storage encoding method. In fact, the encoding is done by the compiler. printf just treats his parameters as a byte array, and does not know or need the concept of character encoding .

This obviously leads to some problems, such as UTF-8 encoded source file compiled production executable file, can only output string UTF-8 encoding, once run on a non-UTF-8 terminal, it will not be displayed correctly (or need to use the call Iconv tool for transcoding). Obviously, this "compile-time determination of character encoding" method is very inflexible. So how to improve it, so printf () provides a%ls indicator, which corresponds to a string of type wchar_t, also known as a wide character. Each character of the wchar_t type occupies 4 bytes in memory, the content of which is the Unicode encoding of the character (note that it is not UTF-8), and its encoding is fixed and does not change because of the source file's stored encoding. To distinguish it from a traditional char, declare a constant using the L "character" format. The following is a wide-character version of the source file.

int main (int argc, char** argv) {    printf ("%ls", L "Chinese");    return 0;}

By%ls,printf It is understood that the pointer to the back is a wide string, not a multibyte string (a simple byte array), so that the wide string is encoded in the appropriate character before the write is finally called, and then output. With Strace debugging running, we found that eventually it would call write (1, "-N", 2), which is obviously not correct. The reason is that there is a problem with the character encoding before calling write. Let's look at the memory of the wide string "Chinese" first. To facilitate GDB debugging, modify the source program slightly:

int main (int argc, char** argv) {    wchar_t str[] = L "Chinese";    printf ("%ls", str);    return 0;}

Debug with GDB to see the memory that Str points to:

(GDB) P str$2 = L "Chinese" (gdb) x/12xb &str0xbffff684:     0x2d    0x4e    0x00    0x00    0x87    0x65    0x00    0x000xbffff68c:     0x00    0x00    0x00    0x00

It can be seen that each character occupies 4 bytes, the content is a Unicode code point, and the memory structure of this constant string is, of course, determined at compile time. So why does the end result in coding as "-n"? The reason is that when printf ("%ls") encodes a wide string, it is based on the locale of the program runtime, and we do not explicitly set the locale in the program, then the default C locale is used, that is, the wide character will be converted to the corresponding ASCII code. This conversion is not actually possible, so it is simply a "do not convert", so that when the third byte 0x00 is encountered, the string ends, and the preceding two bytes 2D 4E is actually the ASCII code-n two characters.

Transcoding is not successful, but at least it is transcoded at run time . Below, we change the program slightly, adding the ability to set locale.

int main (int argc, char** argv) {    setlocale (Lc_all, argv[1]);    wchar_t str[] = L "Chinese";    printf ("%ls", str);    return 0;}

This allows us to provide the value of the locale for the program at run time, continue to use the Strace trace, strace./a.out zh_cn. UTF-8, we found that the final call to write (1, "\344\270\255\346\226\207", 6), this time the wide string is encoded according to UTF-8, and then call write to the terminal, because the current terminal is also used by the UTF-8 encoding, So I can correctly display the word "Chinese". So what if the terminal code is GBK? No problem, just provide ZH_CN when executing the program. GBK This parameter is a yes. As follows:

./a.out ZH_CN. GBK

The continuation trace found that the final call to write (1, "\326\320\316\304", 4), visible, was written to the GBK encoding of "Chinese".

To make a summary: (1) printf ("%s") or printf ("") is the string as a byte array directly called Write write out, does not involve character encoding.

(2) printf ("%ls") encodes a wide string based on the runtime locale and calls write.

(3) Set character encoding rules by setlocale ().

There are four places involved in character encoding: (1) The memory representation of the source file (2) string (3) The transcoding inside of printf (4) The terminal itself

As long as the transcoding in (3) is correct, and (3) and (4) the same character encoding is used, then the Application section is prepared for the correct display of the string, as well as the actual operating system and actual device support, which is the second stage, if the string is to be displayed correctly on the display.

2 Phase II: Terminal invoke display device display

A system call to write (1, byte array, length) eventually calls the _write () function of the terminal device, which then calls the underlying hardware (such as a video card) to drive the control display. Let's look at the case of simple string ABC, because they can be displayed correctly regardless of any type of terminal device.

The terminal can be divided into three categories according to the underlying device. First, the bottom output device is a text mode VGA graphics display, the second is the bottom output device is graphics mode VGA display device, the third is pseudo-terminal equipment, the bottom output device is the window of other GUI system. The display process was discussed roughly for these three kinds of terminals. For the character mode VGA, the character generator in the graphics card firmware never supports Chinese; For graphics mode graphics cards, in theory, as long as the Chinese-style bitmap can be displayed, but in fact, the terminal software itself does not support Unicode encoding, can not display Chinese, Unofficial some software such as Zhcon to some extent support Chinese, but far from perfect; for the third, almost all of the GUI Windows support Chinese output, which is the most perfect terminal type at the moment.

3 Differences in different platforms this article discusses printf () on the Linux platform, but printf () is a C standard library function, which is theoretically common to all platforms. But there are actually a lot of differences between printf () on different platforms, especially after the introduction of wchar_t. For example, wchar_t itself, gcc compile time size is 4 bytes, and VC + + compile time is 2 byte size, and as wprintf () function, in VC + + does not support%ls, but the use of%s to represent a wide string.
Character encoding is one of the most fundamental and important topics in the computer world, and various encoding transformations make string processing more complex before Unicode is still completely unified around the world. Garbled problem always plagued the vast number of programmers, in-depth understanding of the coding principle and conversion details is the best choice to solve garbled problems, adhere to the use of Unicode is a good habit of reducing all kinds of trouble.

From calling printf () to a character that can see the output

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More