Turn: Take you to play with the visual studio--with your understanding of multibyte encoding with Unicode code

Last Update:2016-06-29 Source: Internet

Author: User

Tags coding standards

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article takes you to the visual studio--takes you out of the pit Dad's runtime library pit helps us understand the various types of C/s + + runtime libraries in Windows and its ins and outs, which is a particularly easy place to go astray in C + + development, We summarized and summed it up. In this article we will continue to explain another concept that is easily confused in C + + development-multibyte character sets and Unicode character sets.

Multibyte characters with wide-byte characters char with wchar_t

We know that there are two types of characters represented in C + + basic data types: char, wchar_t.
Char is called a multibyte character , one char takes one byte, and a multibyte character is called because it represents a word when it may be a byte or multiple bytes. An English character (such as ' s ') is represented by a char (a byte), and a Chinese character (such as ' Medium ') is represented by 3 char (three bytes), as shown in the following example.

void TestChar ()
{
     char ch1 = ‘s’; // correct
     cout << "ch1:" << ch1 << endl;
     char ch2 = ‘中’; // error, a char cannot completely store a Chinese character message
     cout << "ch2:" << ch2 << endl;

     char str [4] = "中"; // The first three bytes store the Chinese character ‘zhong’, and the last byte stores the string terminator \ 0
     cout << "str:" << str << endl;
     // char str2 [2] = "国"; // Error: ‘str2’: array bounds overflow
     // cout << str2 << endl;
}

The nodes are as follows:

Ch1:s
CH2:
STR: Medium

wchar_t is called a wide character , and a wchar_t occupies 2 bytes. The wide character is called because all the words are represented in two bytes (that is, a wchar_t), whether in English or Chinese . Look at the following example:

void TestWchar_t ()
{
     wcout.imbue (locale ("chs")); // Set the localization language of wcout to Chinese

     wchar_t wch1 = L‘ s ’; // correct
     wcout << "wch1:" << wch1 << endl;
     wchar_t wch2 = L‘ Medium ’; // Correct, a Chinese character is represented by a wchar_t
     wcout << "wch2:" << wch2 << endl;

     wchar_t wstr [2] = L "中"; // The first two bytes (the previous wchar_t) store the Chinese character ‘中’, and the last two bytes (the next wchar_t) store the string terminator \ 0
     wcout << "wstr:" << wstr << endl;
     wchar_t wstr2 [3] = L "China";
     wcout << "wstr2:" << wstr2 << endl;
}

The results are as follows:

Ch1:s
CH2: Medium
STR: Medium
STR2: China

Description:
1. When assigning a value to a wchar_t variable with a constant character, add l to the front. such as: wchar_t WCH2 = L ' in ';
2. When assigning a value to a wchar_t array with a constant string, add the L to the front. such as: wchar_t wstr2[3] = L "China";
3. If you do not add L, the English can be normal, but for non-English (such as Chinese) will be wrong.

string and wstring

A character array can represent a string, but it is a fixed-length string, and we must know the length of the array before using it. To facilitate the manipulation of strings, STL defines the string and wstring for us. We are certainly not unfamiliar with string, but wstring may use less.

The string is an ordinary multibyte version, which is a char-based package of char arrays.

Wstring is a Unicode version, a wchar_t-based, encapsulation of an wchar_t array.

Related conversions of string to wstring:

The following two methods are cross-platform, can be used under Windows, or can be used under Linux.

 
 
#include <cstdlib>
#include <string.h>
#include <string>

// wstring => string
std::string WString2String(const std::wstring& ws)
{
    std::string strLocale = setlocale(LC_ALL, "");
    const wchar_t* wchSrc = ws.c_str();
    size_t nDestSize = wcstombs(NULL, wchSrc, 0) + 1;
    char *chDest = new char[nDestSize];
    memset(chDest,0,nDestSize);
    wcstombs(chDest,wchSrc,nDestSize);
    std::string strResult = chDest;
    delete []chDest;
    setlocale(LC_ALL, strLocale.c_str());
    return strResult;
}

// string => wstring
std::wstring String2WString(const std::string& s)
{
    std::string strLocale = setlocale(LC_ALL, ""); 
    const char* chSrc = s.c_str();
    size_t nDestSize = mbstowcs(NULL, chSrc, 0) + 1;
    wchar_t* wchDest = new wchar_t[nDestSize];
    wmemset(wchDest, 0, nDestSize);
    mbstowcs(wchDest,chSrc,nDestSize);
    std::wstring wstrResult = wchDest;
    delete []wchDest;
    setlocale(LC_ALL, strLocale.c_str());
    return wstrResult;
}

Character Set (Charcater set) and character encoding (Encoding)

Character Set (Charcater set or charset): A collection of all the abstract characters supported by a system, that is, a collection of a series of characters. Characters are all kinds of words and symbols, including the national text, punctuation, graphic symbols, numbers and so on. Common character sets are: The ASCII character set, the GB2312 character set (mainly used for processing Chinese characters), the GBK character set (mainly used for processing Chinese characters), the Unicode character set, and so on.

character encoding (Character Encoding): is a set of rules that can be used to pair a character set of natural languages (such as an alphabet or a syllable table) with binary numbers that the computer can recognize. That is, it can establish correspondence relation between symbol set and digital system, it is a basic technology of information processing. Usually people use symbolic sets (usually text) to express information, while the computer's information processing system uses binary numbers to store and process messages. Character encoding is the conversion of symbols into binary encodings that can be recognized by computers.

Typically a character set is equivalent to one encoding, and the ANSI system (ANSI is a character code that, for the computer to support more languages, typically uses a 0x80~0xff range of 2 bytes to represent 1 characters), such as ASCII, ISO 8859-1, GB2312, GBK and so on. In general, we say that one encoding is for a particular character set.
There can also be multiple encodings on a character set, such as UTF-8, UTF-16, UTF-32, and so on, for example, the UCS character set (also the character set used by Unicode).

From the historical point of view of computer character coding, there are about three stages:
First stage: ASCII character set and ASCII encoding.
The computer is just beginning to support English (that is, Latin characters), and other languages cannot be stored and displayed on the computer. ASCII uses a 7-bit (bit) of one byte (byte) to represent one character, and the first position is 0. Later, in order to show more European characters commonly used symbols and extended ASCII, and the EASCII,EASCII with 8 bits to represent a character, so that it can represent more than 128 characters, supporting some Western European characters.

Second stage: ANSI encoding (localized)
To enable the computer to support more languages, you typically use the 0x80~0xff range of 2 bytes to represent 1 characters. For example: Chinese characters ' in ' in the Chinese operating system, using [0xd6,0xd0] These two bytes of storage.
Different countries and regions have developed different standards, resulting in GB2312, BIG5, JIS and other coding standards. These use 2 bytes to represent a character of a variety of Chinese character extension encoding, called ANSI encoding. Under the Simplified Chinese system, ANSI encoding represents GB2312 encoding, and in Japanese operating system, ANSI encoding represents JIS code.
Different ANSI encodings are incompatible, and when information is exchanged internationally, text that is in two languages cannot be stored in the same piece of ANSI-encoded text.

Phase three: UNICODE (internationalization)
In order to facilitate international exchange of information, international organizations have developed a UNICODE character set that sets a uniform and unique numeric number for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. UNICODE is common in three ways: UTF-8 (1 bytes), UTF-16 (2 bytes), UTF-32 (4 byte table ).

We can use a tree chart to represent the various character sets and coded branches that have evolved from ASCII:

Figure 1: Compilation of various types

If you want to learn more about character sets and character encodings, refer to:
Character Set and character encoding (Charset & Encoding)

Multi-byte and wide-character compounding in engineering

Right-click your project name->properties, set as follows:

Figure 2:character Set

When set to use Unicode Character set, there will be precompiled macros: _unicode, Unicode

Figure 3:unicode
When set to use Multi-Byte Character set, there is a precompiled macro: _MBCS

Figure 4:multi-byte

What is the difference between the Unicode Character set and the Multi-Byte Character set (multi-byte character set (MBCS)?

What is the difference between the Unicode Character set and the Multi-Byte Character set? Let's look at an example:
There is a program that needs to pop up with the MessageBox box:

#include "windows.h"

void TestMessageBox ()
{
     :: MessageBox (NULL, "This is a test program!", "Title", MB_OK);
}

The above demo is very simple not to say more! When we set Character set to Multi-Byte Character set, we can compile and run normally. But when we set it to Unicode Character set, we have the following compilation error:

Error C2664: ' MessageBoxW ': cannot convert parameter 2 from ' const char [+] ' to ' LPCWSTR '

This is because the MessageBox has two versions, one MESSAGEBOXW for the Unicode version, one for the MessageBoxA for Multi-Byte, and they are separated by different macros, which use different versions of the preset macros. We used the use Unicode Character set to preset the _UNICODE, Unicode macros, so at compile time will use MessageBoxW, when we passed the multibyte constant string is definitely a problem, but should pass the wide character string, will be " Title "to L" title "On it," This is a test program! " as well.

WINUSERAPI
int
WINAPI
MessageBoxA(
    __in_opt HWND hWnd,
    __in_opt LPCSTR lpText,
    __in_opt LPCSTR lpCaption,
    __in UINT uType);
WINUSERAPI
int
WINAPI
MessageBoxW(
    __in_opt HWND hWnd,
    __in_opt LPCWSTR lpText,
    __in_opt LPCWSTR lpCaption,
    __in UINT uType); #ifdef UNICODE #define MessageBox  MessageBoxW #else #define MessageBox  MessageBoxA #endif // !UNICODE

the Multi-Byte Character set above generally refers to the ANSI (multibyte) character set, and for ANSI refer to the second bar character set (Charcater set) and character encoding (Encoding). The Unicode Character set is the Unicode character set, which generally refers to UTF-16 encoded Unicode. That is, each character is encoded in two bytes, two bytes can represent 65,535 characters, and 65,535 characters can represent most of the world's languages.

Unicode is generally recommended, as it can be adapted to the language of each country and will be very much used in international software. we use multibyte methods only when the storage requirements are very high, or if you want to be compatible with C code.

Understand _t (), _text () macro, or L ""

In addition to using the L "title" in the previous section's call to the MessageBox, you can also use _t ("title") and _text ("title"). And you'll find that the _t and _text are used more in MFC and WIN32 programs, what's the difference between _t, _text, and L?

Through the first bar multibyte character and the wide byte character we know that it is possible to denote a multi-byte character (char) string constant in general double quotation marks, such as "string Test", and a wide-byte character (wchar_t) string constant with an L, such as L "string test", before quotation marks.

View the definition of the Tchar.h header file we know that the function of _t and _text is the same, it is a predefined macro.

#define _T(x)       __T(x)#define _TEXT(x)    __T(x)

Let's take a look at the definition of __t (x) and find it has two:

#ifdef _UNICODE
// ... omit other code
#define __T (x) L ## x
// ... omit other code
#else / * ndef _UNICODE * /
// ... omit other code
#define __T (x) x
// ... omit other code
#endif / * _UNICODE * /

Is that clear? When the Character set of our project is set to use Unicode Character set, _t and _text will precede the constant string with L, otherwise (that is, when using Multi-Byte Character set) it will be treated as a generic string.

Dword, LPSTR, LPWStr, LPCSTR, LPCWSTR, LPTSTR, LPCTSTR

VC + + There are some common macros you may be confused, such as DWORD, LPSTR, LPWStr, LPCSTR, LPCWSTR, LPTSTR, LPCTSTR. Here we summarize:
Common macros:

type	MBCS	UNICODE
WCHAR	wchar_t	wchar_t
LPSTR	char*	char*
LPCSTR	Const char*	Const char*
LPWStr	wchar_t*	wchar_t*
Lpcwstr	Const wchar_t*	Const wchar_t*
TCHAR	Char	wchar_t
LPTSTR	tchar* (or char*)	tchar* (or wchar_t*)
Lpctstr	Const tchar*	Const tchar*

Mutual conversion methods:
LPWSTR->LPTSTR:W2T ();
LPTSTR->LPWSTR:T2W ();
LPCWSTR->LPCSTR:W2CT ();
LPCSTR->LPCWSTR:T2CW ();

ANSI->UNICODE:A2W ();
UNICODE->ANSI:W2A ();

String Functions:
There are also some manipulation functions for strings, and they also have a one by one correspondence relationship:

MBCS	UNICODE
Strlen ();	Wcslen ();
strcpy ();	wcscpy ();
strcmp ();	WCSCMP ();
Strcat ();	Wcscat ();
STRCHR ();	WCSCHR ();
...	...

With these functions and the name of the macro you may have found some of the law, usually prefixed w (or suffix W) is used for wide characters, without the prefix w (or with suffix a) is generally used for multibyte characters.

Understanding the causes of CString and the mechanism of work

CString: Dynamic TCHAR array, which is a kind of enclosing the TCHAR array. It is a completely separate class that encapsulates operators such as "+" and string manipulation methods, in other words, CString is a collection of methods for TCHAR operations. Its role is to facilitate the WIN32 program and MFC programs for string processing and type conversion.

For more detailed usage of CString, please refer to:
The difference and conversion between CString and string, char*
Common uses of CString

Reference article:
Character Set and character encoding (Charset & Encoding)
Characters, bytes, and encodings
"Windows core Programming series" Two talk about ANSI and Unicode character sets
Dword, LPSTR, LPWStr, LPCSTR, LPCWSTR, LPTSTR, LPCTSTR

Transferred from: http://blog.csdn.net/luoweifu/article/details/49382969

Turn: Take you to play with the visual studio--with your understanding of multibyte encoding with Unicode code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More