Storage and Output Analysis of Windows platform characters, windows Platform

Source: Internet
Author: User

Storage and Output Analysis of Windows platform characters, windows Platform
1. Introduction

(Written in)
The two most common character sets in Windows NT operating systems are ANSI and Unicode. ANSI is a general term. The ANSI code in each country or region is different. For example, in Windows XP Simplified Chinese version, ANSI is encoded as GBK, in Windows XP Japanese, ANSI is encoded as JIS. The full name of Unicode is Universal Multiple-Octet Coded Character Set. The Chinese meaning is "general multi-octal encoding Character Set ". Unicode aims to provide a unique and unified character encoding set for all characters in the world. Therefore, it is not managed in any operating system where it is stored. the encoding of a definite character is unique. Since Unicode stores character encoding in bytes greater than or equal to 2, the order of bytes stored in different operating systems may be different, which can be divided into large-end mode and small-end mode.

Storage Method

For example, in the Simplified Chinese version of Windows XP, the ANSI/GBK and Unicode characters of the Chinese character are:

Character

Medium

Text

ANSI/GBK

0XD6D0

0XCEC4

Unicode

0X4E2D

Zero X 6587

2. Memory storage mode

In VC, two types of characters are defined: char and wchar_t. The char type adopts ANSI/GBK encoding, while the wchar_t adopts Unicode encoding. wchar_t is also a common wide character type.

Define the following two strings

Char * str = "Chinese"; wchar_t * wcstr = L "Chinese ";

Through debugging, we can see the storage mode of these two strings in the memory. After compilation, the pointer str points to the address 0x004188CC. The expression of the "Chinese" character in the memory is: d6 d0 ce c4, which is exactly the GBK code of the "Chinese" character, therefore, if a character string is defined as char * in VC, It is encoded as ANSI/GBK, as shown in 1.

Encoding

Hexadecimal content

ANSI

D6 D0 CE C4

Unicode

Ff fe 2D 4E 87 65

Unicode big endian

Fe ff 4E 2D 65 87

UTF-8

Ef bb bf E4 B8 AD E6 96 87

We can see from the table that if you select ANSI encoding, the system default encoding will be used for direct storage in large-end mode. For Windows XP Simplified Chinese version, the system default encoding is GBK, therefore, the content stored in the file is the GBK code of "Chinese" D6 D0 CE C4.

If it is Unicode encoding, add a "zero width no-break space" sign at the beginning of the file as required, which can be directly translated as "zero width, non-line feed SPACE ", the target is to identify the way in which the file stores Unicode codes. The Unicode code of "Chinese" is "4E2D 6587", which is commonly stored in three modes

Table 2

Storage Method

String encoding content

UTF-16 Little Endian (Small End)

2D4E 8765

UTF-16 Big Endian)

4E2D 6587

UTF-8

E4B8AD E69687

Note: UTF-8 is variable length, store a letter to a byte, a Chinese character to three bytes; UTF-16 is fixed length, whether to store a letter or a Chinese character requires two bytes, so storage of letters with UTF-16 will cause a waste of space.

The three storage methods correspond to

Table 3

Storage Method

Corresponding ID

UTF-16 Little Endian (Small End)

FF FE

UTF-16 Big Endian)

FE FF

UTF-8

EF BB BF

So we can see from table 1 that,

1. If "Unicode" is selected, the string will be compiled into a Unicode code, stored in Small-end mode of UTF-16;

2. If you select "Unicode big endian", the string will be compiled into a Unicode code, stored in the UTF-16 large-end mode;

3. if you select UTF-8, the string will be compiled into Unicode code, stored as UTF-8.

We can also know from the above that if the first two bytes of a text file are "FFFE", then the file must be a Unicode code that stores characters in a small-end mode, the third byte is the low byte of the Unicode code, and the fourth byte is the high byte of the Unicode code. Based on these two high and low bytes, a Unicode character can be obtained. The fifth byte is the low byte of the Unicode code of the second character, and the sixth byte is the high byte of the Unicode code of the second character.

The UTF-8 code is to save the Unicode code of characters to 1 ~ according to certain rules ~ The Unicode code of characters can also be obtained based on the UFT-8 code in 4 bytes, please do not refer to other documents.

4. character output mode

I know how to encode characters on a computer, how to store them, and how to output them?

4.1 Windows console output mode

Widows maintains a console output buffer internally. to output a string to the console, you only need to copy the memory region corresponding to the string to the console buffer, windows outputs the Console Buffer content to the console window with the default character encoding. For Windows XP Simplified Chinese version, the default character encoding is GBK, so Windows will output the Console Buffer content in GBK format. To output the buffer control content correctly in the console window of Windows XP Simplified Chinese version, you must ensure that the character encoding copied to the console buffer is GBK.

4.2 C/C ++ output strings to the console

For C-language printf () Functions and std: cout objects in C ++, they all call the WriteConsole () function in the system "kernel32.dll, copy the memory area corresponding to the string to the buffer zone of the console.

For a char * string, the output function provided by C language is printf (), and for a string of the wchar_t * type, the output function provided by C language is wprintf ().

In VC, char * characters are compiled into ANSI (GBK) codes, which are exactly the same as the encoding type of the output buffer, so they can be directly output. For a string of the wchar_t * type, VC compiles the string into a Unicode code during program compilation. If the program runs the program, it directly copies the memory area corresponding to the string to the output buffer, because the character string encoding and the default encoding on the console are not required, the console will output Unicode codes as GBK codes to the console in case of chaos.

A feasible method is to convert the Unicode code into a GBK code first, and then copy it to the output buffer of the console, so that there will be no garbled problem.

Output char * And wchar_t * strings in C and C ++

// C Language Output char * type string (ANSI/GBK) void cprintchar (const char * str) {printf ("% s \ n", str );} // C Language Output wchar_t * type string (Unicode) void cprintwchar (const wchar_t * wcstr) {// tell the program console what encoding the buffer uses // <locale. h> setlocale (LC_ALL, "ZHI"); wprintf (L "% ls \ n", wcstr );} // The C ++ language outputs the char * type string (ANSI/GBK) void ccprintchar (const char * str) {std: cout <str <std: endl ;} // The C ++ language outputs the wchar_t * type string (Unicode) void ccprintwchar (const wchar_t * wcstr) {// informs the program console of the encoding used by the buffer // requires <locale> std:: wcout. imbue (std: locale ("ZHI"); std: wcout <wcstr <std: endl ;}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.