Storage and output analysis of Windows platform characters

Last Update:2015-01-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduction

(written in 2011-07-30)
The two most commonly used character sets in the Windows NT family of operating systems are ANSI and Unicode. ANSI is a general term, each country or region of the ANSI code is different, for example, in the Windows XP Simplified Chinese version, ANSI encoding is GBK, and in Windows XP Japanese version of the ANSI encoding is JIS. The full name of Unicode is universal Multiple-octet Coded Character set, Chinese meaning "universal multi eight-bit coded character set". The goal of Unicode is to provide a set of unique, uniform character encodings for all the characters in the world, so no operating system is managed in the vouch place, and a certain character encoding is unique. Because Unicode uses greater than or equal to 2 bytes to store character encodings, it is possible to store byte order differently in different operating systems, which can be divided into big-endian and small-ended ways.

Storage mode

In "Chinese" two Chinese characters, in the Simplified Chinese version of Windows XP, the "Chinese" two characters of the ANSI/GBK and Unicode are:

Character	In	Text
Ansi/gbk	0xd6d0	0xcec4
Unicode	0x4e2d	0x6587

2. In-memory storage mode

In VC, there are two ways to define characters: char type and wchar_t type. The char type is ANSI/GBK encoded, while the wchar_t is Unicode-encoded, and wchar_t is often said to be wide-character.

Define the following two strings

Char " English " ; wchar_t* wcstr = L" Chinese ";

By debugging we can see how these two strings are stored in memory. After compiling, the pointer str points to the address of 0x004188cc, "Chinese" two words in memory is represented by: D6 d0 ce c4, just "Chinese" two words GBK code, so that a string in the VC is defined as the char* type, then the character will be compiled into ansi/ GBK code, shown in 1.

Figure 1 ANSI encoding for "Chinese" in memory

The wchar_t type pointer wcstr points to an address of 0X004188C4, as shown in 2. From the figure, in the VC, the wchar_t type of string will be compiled into Unicode code, and by UTF-16 small-end way to store.

Figure 2 Unicode encoding for "Chinese" in memory

Unicode only specifies the encoding of characters and does not specify how to store those encodings. There are three common ways to store Unicode code:

1, UTF-16 small end mode: Use two bytes to store Unicode code, low byte in front, high byte behind;

2, UTF-16 big-endian way: With two bytes to store Unicode code, high-byte in front, low byte after;

3, UTF-8: with 1~4 bytes in accordance with a certain rules to store Unicode code, Chinese characters need 3 bytes.

Figure 2, "2d4e8765" is the "Chinese" two Chinese characters of the Unicode code "4e2d6587" by UTF-16 small end way storage.

3. How to store on disk

Open windows Notepad, enter "Chinese" two words, select ANSI, Unicode, Unicode big endian, and UTF-8 as four text files in the Encoding drop-down box in the Save As dialog box, and then open it with a hexadecimal text editor, with the following table

Table 1

Coding	Hexadecimal content
Ansi	D6 D0 CE C4
Unicode	FF FE 2D 4E 87 65
Unicode Big Endian	FE FF 4E 2D 65 87
UTF-8	EF BB BF E4 B8 AD E6 96 87

As can be seen from the table, for the selection of ANSI encoding, the system default encoding by the big-endian direct storage, for Windows XP Simplified Chinese version, the system default encoding is GBK, so the contents of the file is "Chinese" two words of the GBK encoding D6 D0 CE C4.

In the case of Unicode encoding, by adding a "ZERO width no-break space" logo at the beginning of the file, it can be literally "0-width non-newline space" and the goal is to identify the way in which the file stores Unicode code. The Unicode code for "Chinese" is "4e2d 6587" and is commonly stored in three different ways.

Table 2

Storage mode	String encoding Content
UTF-16 Little Endian (small end)	2D4E 8765
UTF-16 Big Endian (endian)	4e2d 6587
UTF-8	E4B8AD E69687

Note: UTF-8 is a variable length, store a letter to a byte, a Chinese character to three bytes; UTF-16 is a fixed length, whether it is to store a letter or a Chinese character requires two bytes, so the use of UTF-16 to store letters will cause space waste.

These three storage methods correspond to the identification of

Table 3

Storage mode	The corresponding identification
UTF-16 Little Endian (small end)	FF FE
UTF-16 Big Endian (endian)	FE FF
UTF-8	EF BB BF

So from table 1,

1, if you choose "Unicode", the string will be compiled into Unicode code, according to UTF-16 small-end way to store;

2, if you choose "Unicode big endian", will compile the string into Unicode code, UTF-16 big-endian way to store;

3. If you select "UTF-8", the string will be compiled into Unicode code and stored as UTF-8.

From the above we can also know that if the first two bytes of a text file is "FFFE", then this file must be a small end to store character Unicode code, the third byte is the low byte of the Unicode code, the fourth byte is a high byte of Unicode code, You can derive a Unicode character based on these two high and low bytes. The fifth byte is the low byte of the second character's Unicode code, and the sixth byte is the high byte of the second character's Unicode code.

UTF-8 code is the Unicode code of the characters in accordance with a certain rules into 1~4 bytes, according to UFT-8 code can also be derived from the character Unicode code, please do not refer to other documents.

4. How characters are exported

Knowing how the characters are encoded in the computer, how to store them, then how will these output be?

4.1 How the Windows console is exported

Widows maintains a console output buffer internally, and Windows prints the contents of the console buffer to the console window with the default character encoding to output a string to the console, as long as the memory region corresponding to the string is copied to the console buffer. For the Simplified Chinese version of Windows XP, the default character encoding is GBK, so windows outputs the contents of the console buffer as a GBK code. To make the console window of the Windows XP simplified version output the contents of the control buffer correctly, you must ensure that the character encoding that is copied to the console buffer is the GBK code.

4.2 Output of strings to the console in C + +

The printf () function for C and the Std::cout object in the C + + language are actually calls to the Writeconsole () function in the system "kernel32.dll" to copy the memory area corresponding to the string to the buffer of the console.

For strings of type char*, the output function provided by C is printf (), and for strings of type wchar_t*, the output function provided by the C language is wprintf ().

In VC, the characters of the char* type are compiled into ANSI (GBK) code, exactly the same as the encoding type of the output buffer, so it can be output directly. For the wchar_t* type string, when the VC compiles the program, it compiles the string into Unicode code, and if the program runs, directly copies the corresponding memory region of the string to the output buffer, because the encoding of the string and the default encoding of the console are not When the console outputs the Unicode code to the console as a GBK code, there is a mess.

One possible way is to first convert the Unicode code to GBK code and then copy it to the output buffer of the console so that there is no garbled problem.

Output char* type and wchar_t* type strings in C and C + + languages

//C language Output char* type string (ANSI/GBK)voidCprintchar (Const Char*str) {printf ("%s\n", str);}//C language Output wchar_t* type string (Unicode)voidCprintwchar (Constwchar_t*wcstr) {    //tell the program console which encoding to use for the buffer//<locale.h>SetLocale (Lc_all,"ZHI"); wprintf (L"%ls\n", wcstr);}//C + + language output char* type string (ANSI/GBK)voidCcprintchar (Const Char*str) {Std::cout<< Str <<Std::endl;}//C + + language output wchar_t* type string (Unicode)voidCcprintwchar (Constwchar_t*wcstr) {    //tell the program console which encoding to use for the buffer//need <locale>Std::wcout.imbue (Std::locale ("ZHI")); Std::wcout<< wcstr <<Std::endl;}

Storage and output analysis of Windows platform characters

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Storage and output analysis of Windows platform characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Storage and output analysis of Windows platform characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support