During the past two days, I have encountered the wide character problem:
- Question 1: Why do we need to call setlocale (lc_all, "CHS") before using wsprintf to output Unicode-encoded strings "); for strings output by printf with multi-byte encoding, both setlocale and other strings can be output normally. For example:
Void main () {wchar_t * wstr = l ""; setlocale (lc_all, "CHS"); wprintf (L "% ls", wstr );}
Output:
Void main () {char * wstr = ""; setlocale (lc_all, "CHS"); // do you want printf ("% s", wstr );}
Output:
- Question 2: Role of % ls. Compare the following four small codes in vs2008, which respectively combine printf wprinf % S % ls to output Unicode strings. To be more targeted, we add setlocale (lc_all, "CHS") to each program ");
Void main () {wchar_t * wstr = l ""; setlocale (lc_all, "CHS"); printf ("% s", wstr );}
Output:
Void main () {wchar_t * wstr = l ""; setlocale (lc_all, "CHS"); printf ("% ls", wstr );}
Output:
Void main () {wchar_t * wstr = l ""; setlocale (lc_all, "CHS"); wprintf (L "% s", wstr );}
Output:
Void main () {wchar_t * wstr = l ""; setlocale (lc_all, "CHS"); wprintf (L "% ls", wstr );}
Output:
According to the previous blog on the internet, % ls explains the subsequent strings in Unicode encoding, but for wprintf, there is no difference between % s and % ls.
Then I found the information for one morning, and now I have solved most of the puzzles. So I sorted out the collected information and used the original source that can be found directly for the reference part.
About character encoding, the problem between C and Unicode introduced: http://blog.csdn.net/bigwhite20xx/article/details/1864908 here said the problem between C and Unicode, that is, the standard C does not support Unicode, this is also the problem I encountered in front of the key, with this understanding, and then combined with a post reply (original Post URL: http://bbs.chinaunix.net/thread-3693579-1-1.html): this problem involves a character, what format (or encoding format) is stored in the source code, what format is stored in the compiled binary file, and what encoding format is output in the final output.
If it is a common string, it is consistent in the form of the three. The wide string may be different.
Taking Linux as an example, because the character encoding used in Linux is utf8, the source code is also saved in utf8. For common strings, nothing is done during the compilation process of the compiler, put this encoding in a binary file. Then, when printf is output, it is also output as is. If the program that receives the output (maybe a shell) supports utf8, it can be displayed normally. If not, it will be messy.
Linux is also used as an example for wide characters. The source code is still utf8, but the compiler will convert the character encoding to Unicode during compilation and save it in the binary file. The output format depends on your locale settings. If the shell supports utf8, but the locale you set is GBK, the program will convert Unicode to GBK encoding output during printf, while the shell here is used as utf8 encoding explanation, at last, of course, it's garbled.
With the above explanation, I believe the problem has been solved. By the way, I would like to add some tips:
- If the character set (Project-project properties-configuration properties-General) in vs2008 is Unicode, it only indicates that the compiler has defined _ Unicode Macros in advance. This macro enables the compiler to compile header files, select appropriate functions and types. For example, if _ Unicode is defined for a macro, cstring is replaced by cstringw, _ tprintf is replaced by wprintf, and MessageBox is replaced by messageboxw, in fact, the DLL provided by Windows does not have function names such as MessageBox. While the macro defines Unicode, not all the strings in the Code are unicode encoded. To use a Unicode-encoded String constant, you must add L in front, such as l "Chinese ABC ". In fact, as described in the previous post, any string in the Code is saved by the character encoding used by the current system (in Vs, it looks like CHS ), only when compiled by the compiler can the compiler convert string constants identified by L to Unicode strings through mbstowcs () and save them in a binary file (PE file.
- Both UTF-8 chs_gb2312 and so on are multi-byte encoding, because of so many character encoding methods, so we output in shell, because the compiler to call wcstombs () unicode is converted to multi-byte encoding, but the compiler does not know the encoding used by the current shell. Therefore, setlocale is required.
For more information about character encoding and setlocale, see:
Http://blog.csdn.net/lovekatherine/article/details/1765903 UTF-8 & Unicode http://blog.csdn.net/dawei_sun/article/details/3541351 about character encoding problems encountered in C ++ standard Io and file processing question 2, not % s and % ls have no effect on wprintf on the web, however, in windows, in Linux, wprintf (L "% s", l "Chinese") cannot produce correct results. See the materials are: http://blog.csdn.net/lovekatherine/article/details/1868724