One of the complete guidelines for C + + strings: Win32 character encoding

Last Update:2017-02-27 Source: Internet

Author: User

Tags character set win32

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

There is no doubt that we've all seen a variety of string types like TCHAR, std::string, BSTR, and strange macros that start with _tcs. You may be looking at the monitor and worrying. This guide summarizes the purpose of introducing various character types, demonstrates some simple usages, and tells you how to implement conversions between various string types if necessary.

In the first section, we will describe 3 types of character encodings. It's important to understand how the various coding patterns work. Even if you already know that a string is an array of characters, you should also read this section. Once you understand these, you will have a clear understanding of the relationships between the various string types.

In the second section, we'll talk about the string class separately, how to use it, and how to implement the transitions between them.

Character Basics--ASCII, DBCS, Unicode

All string classes are based on c-style strings. The C-style string is an array of characters. So let's introduce the character type first. There are 3 encoding modes that correspond to 3 types of characters. The first type of encoding is the list-section character set (Single-byte character set or SBCS). In this encoding mode, all characters are represented in only one byte. ASCII is SBCS. A byte representing 0 is used to flag the end of the SBCS string.

The second encoding mode is the multibyte character set (Multi-Byte character set or MBCS). A MBCS encoding contains some byte-length characters, while others are longer than one byte. The MBCS used in Windows contains two character types, single-byte characters (single-byte characters), and double-byte characters (double-byte characters). Because most of the multibyte characters used in Windows are two bytes long, MBCS is often replaced with DBCS.

In DBCS encoding mode, certain values are reserved to indicate that they are part of a double-byte character. For example, in Shift-jis encoding (a common Japanese encoding pattern), the value between 0x81-0x9f and 0XE0-OXFC indicates "This is a double-byte character and the next child section is part of this character." "Such values are called" leading bytes ", and they are all greater than 0x7f. The byte following a leading byte section is called "Trail byte." In a DBCS, trail byte can be any value other than 0. Like SBCS, the ending flag of a DBCS string is also a single-byte representation of 0.

The third mode of encoding is Unicode. Unicode is an encoding pattern in which all characters are encoded using two bytes. Unicode characters are sometimes also referred to as wide characters because they are wide (using more storage space) than a list of characters. Note that Unicode cannot be considered MBCS. The unique feature of MBCS is that its characters use byte encodings of different lengths. The Unicode string uses 0 of two bytes as its closing flag.

Single-byte characters contain Latin alphabet, accented characters and ASCII standard and DOS operating system-defined graphic characters. Double-byte characters are used to represent the languages of East and Middle East. Unicode is used within COM and Windows NT operating systems.

You must already be familiar with single-byte characters. When you use char, you are dealing with single-byte characters. Double-byte characters are also manipulated using the char type (which is one of the many strange places we'll see on the Gemini character). Unicode characters are represented by wchar_t. Unicode characters and string constants are represented by the prefix L. For example:wchar_t wch = L''1''; // 2 bytes, 0x0031 wchar_t* wsz = L"Hello"; // 12 bytes, 6 wide characters

How characters are stored in memory

Single-byte string: One byte per character is stored sequentially, and finally ends with 0 in Single-byte. For example. The storage form of "Bob" is as follows:

42	6F	62	00
B	O	B	Bos

The form of Unicode storage, L "Bob"

42 00	6F 00	62 00	00 00
B	O	B	Bos

Use a two-byte 0来 to make a closing flag.

At first glance, the DBCS string looks like a SBCS string, but we'll see the subtleties of the DBCS string in a moment, which makes the unexpected result possible when traversing a string using string manipulation functions and a wing character pointer. The string "" ("Nihongo") is stored in memory in the following form (LB and TB are used to denote leading byte and trail byte, respectively)

FA	7B	8C EA	00
LB TB	LB TB	LB TB	Eos
			Eos

It is noteworthy that the value of "ni" cannot be interpreted as a word-type value 0xfa93, but should be considered as two values 93 and FA are encoded as "NI" in this order.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More