Unicode character output ???

Last Update:2018-12-05 Source: Internet

Author: User

Tags knowledge base

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Source:Elegant C ++
(Emmett blog)

I 've been studying Unicode for a few days. I 've copied everything I 've seen. The article is pieced together, so it looks a bit messy :).

1. wprintf
Q: sizeof (wchar_t) =?
A: varies with the compiler. (So do not use wchar_t when cross-platform is required.) VC: sizeof (wchar_t) = 2;

Q: Why is there no result in directly using wprintf (L "test 1234") in VC?
A: locale is not set.

Setlocale (lc_all,
"
CHS
"
);
Wprintf (L
"
% S
"
, L
"
Test 1234
"
);

Or (assume that the current active codePage is CHS)

Char
SCP [
16
];

Int
CP
=
Getacp ();
Sprintf (SCP,
"
. % D
"
, CP );
Setlocale (lc_all, SCP );
Wprintf (L
"
Test 1234
"
);

2. wcout
Same, but set locale, use STD: locale

Locale LOC (
"
CHS
"
);
Wcout. imbue (LOC );
Wcout
<
L
"
Test 1234
"

<
Endl;

This article should be [netsin

.
Note: wprintf is a standard library function of C, but wcout is not a standard member of C ++, and l in C ++ "...... "It is a wide character, but not necessarily a Unicode character, which is related to the compiler implementation.
[Qian Kun smile

]: Why is l "XX" defined by C/C ++ language determined? This is obviously for the universality and portability of C/C ++. Bjarne believes that the C ++ method is to allow programmers to use any character set as the string character type. In addition, Unicode encoding has developed several versions, and it is not clear whether it can be used permanently. For more information about Unicode and comparison with other character sets, I recommend that you read "no-nonsense XML".

The execution environment of the following two pieces of code is Windows XP Professional in English, and the compiler is vs2005rtm.

// C
# Include <stdio. h>
# Include <locale. h>
Int main (void)
{
Setlocale (lc_all, "CHS ");
// Setlocale (lc_all, "Chinese-simplified ");
// Setlocale (lc_all, "Zhi ");
// Setlocale (lc_all, ". 936 ");
Wprintf (L "China ");

Return 0;
}

// C ++
# Include <iostream>
# Include <locale>
Using namespace STD;
Int main (void)
{
Locale LOC ("CHS ");
// Locale LOC ("Chinese-simplified ");
// Locale LOC ("Zhi ");
// Locale LOC (". 936 ");
Wcout. imbue (LOC );
STD: wcout <L "China" <Endl;

Return 0;
}

Note: Do not mix setlocale and STD: locale.

-------------------------

-------------------------

"VC knowledge base" code: 56 43 D6 AA ca B6 BF E2 00 // ANSI code
L "VC knowledge base" encoded in VC ++: 56 00 43 00 E5 77 C6 8B 93 5E 00 00 // (Unicode in Windows) Encoding
L "VC knowledge base" encoded in GCC (Dev-CPP4990): 56 00 43 00 D6 00 AA 00 ca 00 B6 00 BF 00 E2 00 00 00 // simply add 0 to the ANSI Encoding
L "VC knowledge base" failed to compile in GCC (Dev-CPP4992), reported illegal byte sequence

L "VC knowledge base" solution steps in Dev-CPP4992:
A. Save the file as UTF-8 encoded // UTF-8 is one of Unicode, but it is different from (Unicode in Windows)
B. Remove the BOM header: Use a binary Editor (such as Vc) to remove the first three bytes of the UTF-8 file. // Linux/Unix does not use Bom.
C. Use gcc/g ++ for compiling and running

After the above steps, in the dev-cpp4992
"VC knowledge base" encoding: 56 43 E7 9f A5 E8 af 86 E5 Ba 93 00 // UTF-8 encoding. Note that it is no longer ANSI encoding. Therefore, use printf/cout to output garbled characters.
L "VC knowledge base" encoding: 56 00 43 00 E5 77 C6 8B 93 5E 00 00 00 // (Unicode in Windows) Encoding

Supplement: to use wcout and wstring in mingw32, you need to add some macros, such
# DEFINE _ glibcxx_use_wchar_t 1
# Include <iostream>
Int main (void)
{
STD: wcout <1 <STD: Endl;
}
It can be compiled, but it cannot be linked. Google it on the Internet. stlport said that mingw32 is faulty, and mingw32 said that M $'s C Runtime is faulty.

Unicode output of printf and wprintf on the console

1. printf can only provide ANSI/MB output, and does not support output Unicode stream.
Example: wchar_t test []
=
L
"
Test 1234
"
;
Printf (
"
% S
"
, Test); Note: I think this should be printf ("% s", test); otherwise, the output will be terminated when printf encounters a single byte of 0. % S indicates MBCS/Unicode conversion. However, the conversion here does not produce correct output (widechartomultibytes () can be correctly output), for the following reasons:
Is not output correctly

2. wprintf also does not provide Unicode output,
However, he will convert the string of wchar_t into the SB/MB character encoding of locale, and then output
Example: wchar_t test []
=
L
"
Test
"
;
Wprintf (L
"
% S
"
, Test); will output ?? 1234 or no output
Because wprintf cannot convert l "test" to the default ANSI, you need to set localesetlocale (lc_all,
"
CHS
"
);
Wchar_t test []
=
L
"
Test
"
;
Wprintf (L
"
% S
"
, Test); there will be correct output
Equivalent to printf ("% ls", test );

To sum up:Crt I/O functions do not provide Unicode output.

3. Window console since NT4 is a real Unicode Console
However, Unicode string is output. Only Windows APIs and writeconsolew are used.
For example:

Wchar_t test []

=

L

"

Test 1234

"

;
DWORD ws;
Writeconsolew (getstdhandle (std_output_handle), test, wcslen (test ),

&

WS, null );

Correct output without setting locale. Because it is a real Unicode output, it is irrelevant to codePage.

4. How to implement cross-platform console output
Do not use wchar_t and wprintf because these depend on the compiler.
ICU is a mature cross-platform Unicode-supported libary of IBM.

The following is the uprintf Implementation of ICU:

Void

Uprintf (

Const

Unicodestring

&

Str ){

Char

*

Buf

=

0

;
Int32_t Len

=

Str. Length ();
Int32_t buflen

=

Len

+

16

;
Int32_t actuallen;
Buf

=

New

Char

[Buflen

+

1

];
Actuallen

=

Str. Extract (

0

, Len, Buf

/*

, Buflen

*/

);

//

Default codePage Conversion

Buf [actuallen]

=

0

;
Printf (

"

% S

"

, Buf );
Delete Buf;
}

It first converts Unicode string to local codePage, and then printf. Although it is not Unicode output, it works well across platforms.

Postscript:
Mbstowcs (wchar_t * wcstr
, Const char * mbstr
, Size_t count
).
Count: the maximum number of multibyte characters to convert.
The number of characters in the Multi-byte string to be converted plus one to the number of characters in the active locale.
For example, the string "ABC Zhao 123" for C locale is strlen ("ABC Zhao 123"), that is, 8 + 1.
For chinese-simplified.936, count is 7 + 1.
The Count calculation must be in the same locale as mbstowcs.

Reference address: http://blog.programfan.com/trackback.asp? Id = 18802

Article comment

Comments: Star week time: 10:40:00
Char SCP [16]; Int CP = getacp (); Sprintf (SCP, ". % d", CP ); Setlocale (lc_all, SCP ); Wprintf (L "tested 1234 "); Equivalent Setlocale (lc_all ,""); Wprintf (L "tested 1234 ");

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More