Unicode character output ???

Source: Internet
Author: User
Tags knowledge base

Source:Elegant C ++
(Emmett blog)


 

I 've been studying Unicode for a few days. I 've copied everything I 've seen. The article is pieced together, so it looks a bit messy :).

 

1. wprintf
Q: sizeof (wchar_t) =?
A: varies with the compiler. (So do not use wchar_t when cross-platform is required.) VC: sizeof (wchar_t) = 2;

Q: Why is there no result in directly using wprintf (L "test 1234") in VC?
A: locale is not set.

Setlocale (lc_all,
"
CHS
"
);
Wprintf (L
"
% S
"
, L
"
Test 1234
"
);

Or (assume that the current active codePage is CHS)

Char
SCP [
16
];

Int
CP
=
Getacp ();
Sprintf (SCP,
"
. % D
"
, CP );
Setlocale (lc_all, SCP );
Wprintf (L
"
Test 1234
"
);

2. wcout
Same, but set locale, use STD: locale

Locale LOC (
"
CHS
"
);
Wcout. imbue (LOC );
Wcout
<
L
"
Test 1234
"
 
<
Endl;

 

This article should be [netsin


.
Note: wprintf is a standard library function of C, but wcout is not a standard member of C ++, and l in C ++ "...... "It is a wide character, but not necessarily a Unicode character, which is related to the compiler implementation.
[Qian Kun smile


]: Why is l "XX" defined by C/C ++ language determined? This is obviously for the universality and portability of C/C ++. Bjarne believes that the C ++ method is to allow programmers to use any character set as the string character type. In addition, Unicode encoding has developed several versions, and it is not clear whether it can be used permanently. For more information about Unicode and comparison with other character sets, I recommend that you read "no-nonsense XML".

The execution environment of the following two pieces of code is Windows XP Professional in English, and the compiler is vs2005rtm.

// C
# Include <stdio. h>
# Include <locale. h>
Int main (void)
{
Setlocale (lc_all, "CHS ");
// Setlocale (lc_all, "Chinese-simplified ");
// Setlocale (lc_all, "Zhi ");
// Setlocale (lc_all, ". 936 ");
Wprintf (L "China ");

Return 0;
}

// C ++
# Include <iostream>
# Include <locale>
Using namespace STD;
Int main (void)
{
Locale LOC ("CHS ");
// Locale LOC ("Chinese-simplified ");
// Locale LOC ("Zhi ");
// Locale LOC (". 936 ");
Wcout. imbue (LOC );
STD: wcout <L "China" <Endl;

Return 0;
}

Note: Do not mix setlocale and STD: locale.

-------------------------



-------------------------

"VC knowledge base" code: 56 43 D6 AA ca B6 BF E2 00 // ANSI code
L "VC knowledge base" encoded in VC ++: 56 00 43 00 E5 77 C6 8B 93 5E 00 00 // (Unicode in Windows) Encoding
L "VC knowledge base" encoded in GCC (Dev-CPP4990): 56 00 43 00 D6 00 AA 00 ca 00 B6 00 BF 00 E2 00 00 00 // simply add 0 to the ANSI Encoding
L "VC knowledge base" failed to compile in GCC (Dev-CPP4992), reported illegal byte sequence

L "VC knowledge base" solution steps in Dev-CPP4992:
A. Save the file as UTF-8 encoded // UTF-8 is one of Unicode, but it is different from (Unicode in Windows)
B. Remove the BOM header: Use a binary Editor (such as Vc) to remove the first three bytes of the UTF-8 file. // Linux/Unix does not use Bom.
C. Use gcc/g ++ for compiling and running

After the above steps, in the dev-cpp4992
"VC knowledge base" encoding: 56 43 E7 9f A5 E8 af 86 E5 Ba 93 00 // UTF-8 encoding. Note that it is no longer ANSI encoding. Therefore, use printf/cout to output garbled characters.
L "VC knowledge base" encoding: 56 00 43 00 E5 77 C6 8B 93 5E 00 00 00 // (Unicode in Windows) Encoding

Supplement: to use wcout and wstring in mingw32, you need to add some macros, such
# DEFINE _ glibcxx_use_wchar_t 1
# Include <iostream>
Int main (void)
{
STD: wcout <1 <STD: Endl;
}
It can be compiled, but it cannot be linked. Google it on the Internet. stlport said that mingw32 is faulty, and mingw32 said that M $'s C Runtime is faulty.

 

 

Unicode output of printf and wprintf on the console


1. printf can only provide ANSI/MB output, and does not support output Unicode stream.
Example: wchar_t test []
=
L
"
Test 1234
"
;
Printf (
"
% S
"
, Test); Note: I think this should be printf ("% s", test); otherwise, the output will be terminated when printf encounters a single byte of 0. % S indicates MBCS/Unicode conversion. However, the conversion here does not produce correct output (widechartomultibytes () can be correctly output), for the following reasons:
Is not output correctly

2. wprintf also does not provide Unicode output,
However, he will convert the string of wchar_t into the SB/MB character encoding of locale, and then output
Example: wchar_t test []
=
L
"
Test
"
;
Wprintf (L
"
% S
"
, Test); will output ?? 1234 or no output
Because wprintf cannot convert l "test" to the default ANSI, you need to set localesetlocale (lc_all,
"
CHS
"
);
Wchar_t test []
=
L
"
Test
"
;
Wprintf (L
"
% S
"
, Test); there will be correct output
Equivalent to printf ("% ls", test );

To sum up:Crt I/O functions do not provide Unicode output.


3. Window console since NT4 is a real Unicode Console
However, Unicode string is output. Only Windows APIs and writeconsolew are used.
For example:

Wchar_t test []


=


L


"


Test 1234


"


;
DWORD ws;
Writeconsolew (getstdhandle (std_output_handle), test, wcslen (test ),


&


WS, null );

Correct output without setting locale. Because it is a real Unicode output, it is irrelevant to codePage.

4. How to implement cross-platform console output
Do not use wchar_t and wprintf because these depend on the compiler.
ICU is a mature cross-platform Unicode-supported libary of IBM.

The following is the uprintf Implementation of ICU:


Void


Uprintf (


Const


Unicodestring


&


Str ){



Char


 


*


Buf


=


 


0


;
Int32_t Len


=


Str. Length ();
Int32_t buflen


=


Len


+


 


16


;
Int32_t actuallen;
Buf


=


 


New


 


Char


[Buflen


+


 


1


];
Actuallen


=


Str. Extract (


0


, Len, Buf


/*


, Buflen


*/


);


//


Default codePage Conversion






Buf [actuallen]


=


 


0


;
Printf (


"


% S


"


, Buf );
Delete Buf;
}

It first converts Unicode string to local codePage, and then printf. Although it is not Unicode output, it works well across platforms.

Postscript:
Mbstowcs (wchar_t * wcstr
, Const char * mbstr
, Size_t count
).
Count: the maximum number of multibyte characters to convert.
The number of characters in the Multi-byte string to be converted plus one to the number of characters in the active locale.
For example, the string "ABC Zhao 123" for C locale is strlen ("ABC Zhao 123"), that is, 8 + 1.
For chinese-simplified.936, count is 7 + 1.
The Count calculation must be in the same locale as mbstowcs.


Reference address: http://blog.programfan.com/trackback.asp? Id = 18802


 

Article comment
  • Comments: Star week time: 10:40:00
  •  
    Char SCP [16];
    Int CP = getacp ();
    Sprintf (SCP, ". % d", CP );
    Setlocale (lc_all, SCP );
    Wprintf (L "tested 1234 ");
    Equivalent
    Setlocale (lc_all ,"");
    Wprintf (L "tested 1234 ");

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.