Wprintf and wide character display in C

Source: Internet
Author: User

Transferred from: http://blog.csdn.net/lovekatherine/archive/2007/11/06/1868724.aspx [here to the original author say thank you! If you seeArticleWhen reprinting, enter this reprinted address instead of my blog address. Respect others' labor achievements ^_^]

Today, I saw an article on the blog homepage of csdn, "also talking about computer character encoding", because a while ago I translated the article "UTF-8 and Unicode FAQ for Unix/Linux, I have always kept a strong interest in character sets, encoding, Unicode, and other content. Naturally, I will not miss such articles.

The author's articles are well-understood and easy to understand. Although there are some conceptual details, I feel that I have some questions; the author also provides a simple example of using wprintf in Windows to correctly output the string "Chinese". In Linux, I mimic the example given by the author.CodeThe following sample code is written:

# Include <cstdio>
# Include <cstdlib>
# Include <clocale>
# Include <cwchar>

Int main (INT argc, char * argv [])
...{
Wchar_t wstr [] = l "Chinese ";
Setlocale (lc_all, "zh_CN.UTF-8 ");
Wprintf (L "% s \ n", wstr );

Return 0;
}

Here it should be noted that the locale of my machine is "zh_CN-UTF-8"

HoweverProgramBut it surprised me.

Whodare @ whodare: $./A. Out
-N

My first response was to see if there was a problem with the author's sample code. After all, all of the calls here are C's standard library functions, and there should be no portability issues; however, I found the author's code for testing on a Windows machine, and the result made me very depressed. Everything is normal ......

Why is my program in Linux incorrect? I am not convinced, so I began to search with various keywords to see if other people have encountered similar problems. A search result caused my idea. Some people say the problem lies in the format conversion character in wprintf. Replacing % s with % ls does not solve this problem. With a bit of doubt, I modified the above program. After compilation and running, it was actually okay.

# Include <cstdio>
# Include <cstdlib>
# Include <clocale>
# Include <cwchar>

Int main (INT argc, char * argv [])
......{
Wchar_t wstr [] = l "Chinese ";
Setlocale (lc_all, "zh_CN.UTF-8 ");
Wprintf (L "% s", wstr );
Wprintf (L "% ls", wstr );

Return 0;
}

Running result of the above Code

Whodare @ whodare: $./A. Out
-N
Chinese

After the problem is solved, I am still confused: What is the difference between the format conversion character "ls" and "S? Why is there a problem with the original program? How did the "-n" string come out? Why is this problem not found in my program in windows?

So many questions are stuck in my mind. How can I stay secure. Even though you know what it is! It took me one afternoon to carefully read the manual of wprintf and perform various tests with the help of GDB. Finally, I solved all my doubts.

1. All of the following experiments are based on "Chinese" as an example, so it is necessary to first list its unicdoe code value, UTF-8 Code, in order to better understand the following

'Medium 'Unicode code value: U + 4e2d UTF-8 encoded E4 B8 ad
'Wen' Unicode code value: U + 6587 UTF-8 code E6 96 87

2. We need to understand what is the difference between using char [] and wchar_t [] to store "Chinese"

Char STR [] = "Chinese ";
Wchar_t wstr [] = l "Chinese ";

We use the powerful tool GDB to check what values are stored in STR [] and WST [] (please note the correspondence between colors)

(GDB) x/8xb & Str
0xbf83decd: 0xe4 0xb8 0xad 0xe6 0x96 0x87 0x00 0xf0
(GDB) x/12xb & wstr
0xbf83dec0: 0x2d 0x4e 0x00 0x00 0x87 0x65 0x00 0x00
0xbf83dec8: 0x00 0x00 0x00 0x00

It is not hard to see that char STR [] is stored in the "Chinese" UTF-8 encoding, this is because my machine locale is a zh_CN.UTF-8, the source file of the program is naturally using UTF-8 encoding, therefore, when the compiler processes char STR [] = "Chinese";, t initialization of STR [] can be understood as char STR [] = {0xe4, 0xb8, 0xad, 0xe6, 0x96,0x87,0x00}

Wchar_t wstr [] stores the Unicode code value of "Chinese", which complies with the wide character definition in the C standard. Here, we need to explain that the width specified in the C standard is a 16-bit character, starting from GNU glibc 2.2, the wchar_t type is only used to store 32-bit ISO 10646 code values (you can roughly Understand ISO 10646 as Unicode, although they are not the same), and is independent from the locale currently in use; therefore, in the above output, we can see that each Unicode code value is expressed in 32bit instead of 16bit.

Iii. Differences between % s and % ls

I found a post (sadly, I found that in the CS field, the most reliable information is always in English), which provides a detailed explanation of various Format Conversion characters, those who are willing to read the original text directly ignore the text .......

Http://www-ccs.ucsd.edu/c/lib_prin.html

First, the difference between % ls and % s is simple. % ls means that the corresponding parameter will be treated as a string based on the wide character (wide chraracter string, % s means that the corresponding parameter will be treated as a normal string (Multi-byte string.

Second, do not mistakenly think that % s is only used for printf because of the above sentence, while % ls is only used for wprintf. In fact, the (printf, wprintf) and (% s, % ls) tuples are independent of each other, that is, the four combinations between them are acceptable.

Again, printf is used for byte stream, that is, each character in the output stream is 1 byte; while wprintf is used for wide stream, and each character in the output stream is more than 1 byte.

Let's take a look at the difference between % ls and % s with examples.

Example 1 printf + % S + wstr

Printf ("% s", wstr );

Whodare @ whodare: $./A. Out
-N

Haha, this depressing "-n" has appeared again! Why? Let me analyze the operations performed by printf during execution.

If % s is used, printf regards the corresponding parameter wstr as a normal string (although we know it is a WCS rather than MBS). On the other hand, we have seen the memory layout of wstr []. The first 3 bytes are 0x2d, 0x4e, 0x00. We all know that the string in C is ended with '\ 0'. Therefore, printf only processes the first three bytes in wstr [] and queries the ASCII table, 0x2d corresponds to the character '-' and 0x4e corresponds to the character 'n', so we will see the strange output "-n.

Example 2 printf + % ls + wstr

Printf ("% ls", wstr );

Whodare @ whodare: $./A. Out
Chinese

When % ls is used, printf regards the corresponding parameter as the wide string (WCS), and printf corresponds to the byte stream. Therefore, the width character (WCS) must be converted here, to a normal string (MBS ). Here, printf implicitly calls the wcrtomb () standard library function for each width character. What are the conversion rules of the wcrtomb () function? This is the role of setlocale (). wcrtomb will convert the code value stored in wcha_t to the corresponding multi-byte encoding Based on the locale set by the programmer.

Back in the example, the locale of my machine is zh_CN.UTF-8, the corresponding code is UTF-8, therefore, the Unicode code value stored in wstr [] will be converted to UTF-8 encoding and output to the standard output stream, in this way, the UTF-8-encoded console can correctly identify the byte stream and display "Chinese"

Example 3 wprintf + % S + wstr (initial code !)

Wprintf (L "% s", wstr );

Whodare @ whodare: $./A. Out
-N

When % s is used, wprintf regards the corresponding parameter as a common string MBS, although we still know it is actually a WCS. Wprintf uses wide stream. Therefore, you need to convert the given MBS parameter to the WCS and then the wprintf completes the output. This conversion is done by wprintf implicitly calling mbrtowc to MBS, the conversion rules are still related to locale.

We know that the memory layout of wstr is:
0x2d 0x4e 0x00 0x00 0x87 0x65 0x00 0x00
0x00 0x00 0x00 0x00

The conversion result of this "MBS" is l '0x2d '+ L '0x4e' + L '0x00', and the final output result is annoying "-n"

Example 4 wprintf + % ls + wstr

Wprintf (L "% ls", wstr );

Whodare @ whodare: $./A. Out
Chinese

If % ls is used, wprintf regards the corresponding parameter as the wide string "WCS". This time, we finally made no mistake. Therefore, wprintf smoothly writes the specified wide string to the standard output stream, and finally displays "Chinese" correctly"

After reading these four examples, are you still confused about the use of wprintf, printf, % ls, and % s?

Iv. Summary

1. The significance of % ls and % s is to specify the desired parameter string. The difference between printf and wprintf is that the stream of different types is used.

2. It seems that the correct way to output "Chinese" in Linux is wprintf ("% ls \ n", l "Chinese "), in the citation, wprintf ("% s \ n", l "Chinese") successfully operated by the author on Windows cannot work properly in Linux, as for why the standard library function wprintf has different performance in two systems, I have no intention of digging deeper into it, is it another place where VC does not comply with standards ?.......

3. It seems that there is another % s, which is used separately to indicate that the corresponding parameter is a wide string

I am not grateful to anyone who can tell me the answer to this question .......

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/code_robot/archive/2010/06/22/5686176.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.