C ++ string fully guides the second-Win32 character encoding

Source: Internet
Author: User
Introduction

Undoubtedly, we have seen various string types such as tchar, STD: String, BSTR, and strange macros starting with _ TCS. You may be staring at the monitor. This guide will summarize the purpose of introducing various character types, demonstrate some simple usage, and show you how to convert string types when necessary.
In the first part, we will introduce three character encoding types. It is very important to understand how various encoding modes work. Even if you already know that a string is a character array, you should read this section. Once you understand this, you will have a clear understanding of the relationships between various string types.
In the second part, we will separately describe the string class, how to use it and implement conversion between them.

Character Base-ASCII, DBCS, Unicode

All string classes are based on C-style strings. The C-style string is a character array. So we will first introduce the character type. Three encoding modes correspond to three character types. The first encoding type is the single-byte character set or sbcs ). In this encoding mode, all characters are represented in only one byte. ASCII is sbcs. The value 0 in one byte indicates the end of The sbcs string.
The second encoding mode is the multi-byte character set or MBCS ). An MBCS encoding contains some characters long in one byte, while others are larger than the length of one byte. In Windows, MBCS contains two character types: single-byte characters and double-byte characters ). Because most of the Multi-byte characters used in windows are two bytes long, MBCS is often replaced by DBCS.
In DBCS encoding mode, some specific values are reserved to indicate that they are part of dubyte characters. For example, in shift-JIS encoding (a common Japanese encoding mode), the value between 0x81-0x9f and 0xe0-oxfc indicates "This Is A dubyte character, the next subsection is a part of this character. "Such values are called" leading bytes ", and they are all greater than 0x7f. The Byte following a leading byte subsection is called "trail byte ". In DBCS, the trail byte can be any value other than 0. Like sbcs, the ending mark of the DBCS string is also 0 represented by a single byte.
The third encoding mode is Unicode. Unicode is a two-byte encoding mode for all characters. Unicode characters are sometimes called wide characters because they are wider than single-byte characters (more storage space is used ). Note that Unicode cannot be considered as MBCS. The unique feature of MBCS is that its characters are encoded in bytes of different lengths. A Unicode string uses 0 in two bytes as its end flag.
The single-byte character contains the Latin alphabet, accented characters, and ASCII standard and graphic characters defined by the DOS operating system. Dubyte characters are used to represent the languages of East Asia and the Middle East. Unicode is used in the COM and Windows NT operating systems.
You must be familiar with single-byte characters. When you use char, you are processing single-byte characters. Double-byte characters are also operated using the char type (this is one of the many strange things we will see about the double-byte characters ). Unicode characters are represented by wchar_t. Unicode characters and string constants are expressed by the prefix L. For example:

wchar_t wch = L''1''; // 2 bytes, 0x0031wchar_t* wsz = L"Hello"; // 12 bytes, 6 wide characters

How are characters stored in memory?

Single-byte string: each character occupies one byte and is stored in sequence, ending with 0 represented by a single byte. For example. The storage format of "Bob" is as follows:

42 6f 62 00
B O B BOS

Unicode storage format, l "Bob"

42 00 6f 00 62 00 00 00
B O B BOS

Use 0 in two bytes as the end mark.

At a glance, the DBCS string is very similar to the sbcs string, but we will see the nuances of the DBCS string in a moment, which will produce unexpected results when traversing a string using string operation functions and permanent character pointers. The storage format of string "(" nihongo ") in memory is as follows (LB and TB are used to represent leading byte and trail byte respectively)

93 fa 96 7b 8C EA 00
LB TB LB TB LB TB EOS
EOS

It is worth noting that the "Ni" value cannot be interpreted as the word value 0xfa93, but should be considered as two values 93 and FA are encoded as "Ni" in this order.

Use string processing functions

We have seen string functions, strcpy (), sprintf (), and Atoll () in C language. These strings should only be used to process single-byte character strings. The standard library also provides functions that only apply to Unicode strings, such as wcscpy (), swprintf (), and wtol.
Microsoft also added the DBCS string operating version in its CRT (C Runtime Library. The STR *** () function has the DBCS version _ MBS *** () corresponding to the name ***(). If you expected to encounter a DBCS string (If your software will be installed in countries encoded with DBCS, such as China and Japan, you may ), you should use the _ MBS *** () function because they can also process sbcs strings. (A DBCS string may also contain single-byte characters, which is why the _ MBS *** () function can also process sbcs strings)
Let's look at a typical string to illustrate why different versions of string processing functions are needed. We still use the preceding Unicode string l "Bob ":

42 00 6f 00 62 00 00 00
B O B BOS

Because the x86cpu is little-Endian, the value 0 x is stored in the memory as 42 00. Can you see what will happen if this string is passed to the strlen () function? It will first see the first byte 42, then 00, and 00 is the end sign of the string, so strlen () will return 1. If "Bob" is passed to wcslen (), the worse result will be obtained. Wcslen () will first see 0x6f42, then 0x0062, and then read at the end of your buffer until it finds that the 00 00 end mark or caused GPF.
So far, we have discussed the usage and differences between STR *** () and WCS. What is the difference between STR *** () and _ MBS? Understanding the differences between them is very important to use the correct method to traverse the DBCS string. Next, we will first introduce the traversal of strings, and then return to the difference between STR *** () and _ MBS.

Correct traversal and index string

Most of us use the sbcs string to grow, so we often use the ++ and-operations of pointers when traversing strings. We also use the representation of the array icon to manipulate characters in the string. These two methods are used for sbcs and Unicode strings, because the characters in them share the same width, the compiler can correctly return the characters we need.
However, when encountering a DBCS string, We must discard these habits. Here there are two rules for traversing the DBCS string using pointers. If you violate these two rules, your program will have a Bugs related to DBCS.

  • 1. Do not use the ++ operation for forward time, unless you check lead byte every time;
  • 2. Never use-operations for backward traversal.
  • Let's explain rule 2 first, because it is easy to find a real instance code that violates it. Suppose you have a program that saves a setting file in your own directory, and you save the installation directory in the registry. At runtime, you read the installation directory from the Registry, synthesize the configuration file name, and then read the file. Suppose that your installation directory is C:/program files/mycoolapp, then the file name you synthesize should be C:/program files/mycoolapp/config. Bin. When you test the program, you find that the program runs normally.
    Now, imagine that the code for merging file names may be like this:

    bool GetConfigFileName ( char* pszName, size_t nBuffSize ){     char szConfigFilename[MAX_PATH];      // Read install dir from registry... we''ll assume it succeeds.      // Add on a backslash if it wasn''t present in the registry value.     // First, get a pointer to the terminating zero.     char* pLastChar = strchr ( szConfigFilename, ''/0'' );      // Now move it back one character.     pLastChar--;         if ( *pLastChar != ''//'' )         strcat ( szConfigFilename, "//" );      // Add on the name of the config file.     strcat ( szConfigFilename, "config.bin" );      // If the caller''s buffer is big enough, return the filename.     if ( strlen ( szConfigFilename ) >= nBuffSize )         return false;     else         {         strcpy ( pszName, szConfigFilename );         return true;         }}      

    This is a very robust piece of code, but it will go wrong when it encounters DBCS characters. Let's see why. Suppose a Japanese user uses your program and installs it in C :/. The storage format of this name in memory is as follows:
     

    43 3A 5C 83 88 83 45 83 52 83 5C 00
          LB TB LB TB LB TB LB TB  
    C : / EOS

    When getconfigfilename () is used to check ''' // ''at the end, it looks for the last non-zero byte in the installation directory name and determines that it is equal, so no more ''//'' is added ''//''. The result is that the Code returns the wrong file name.
    What went wrong? Look at the above two byte values displayed in blue. The slash ''' value is 0x5c. The value of ''' is 83 5C. The code above incorrectly reads a trail byte and treats it as a character.
    The correct backward Traversal method is to use a function that can recognize DBCS characters to move the pointer to the correct number of bytes. The following is the correct code. (Red indicates the place where the pointer is moved)

    bool FixedGetConfigFileName ( char* pszName, size_t nBuffSize ){     char szConfigFilename[MAX_PATH];      // Read install dir from registry... we''ll assume it succeeds.      // Add on a backslash if it wasn''t present in the registry value.     // First, get a pointer to the terminating zero.     char* pLastChar = _mbschr ( szConfigFilename, ''/0'' );      // Now move it back one double-byte character.     pLastChar = CharPrev ( szConfigFilename, pLastChar );      if ( *pLastChar != ''//'' )         _mbscat ( szConfigFilename, "//" );      // Add on the name of the config file.     _mbscat ( szConfigFilename, "config.bin" );      // If the caller''s buffer is big enough, return the filename.     if ( _mbslen ( szInstallDir ) >= nBuffSize )         return false;     else         {         _mbscpy ( pszName, szConfigFilename );         return true;         }}

    The above function uses charprev () API to move plastchar one character backward, which may be two bytes long. In this version, the IF condition works normally because lead byte will never be equal to 0x5c.
    Let's imagine an occasion that violates Rule 1. For example, you may want to check whether the file name entered by a user appears '':'' multiple times '':''. If you use ++ to traverse strings, instead of charnext (), you may issue an incorrect error warning. If a trail byte has a value equal '': ''value.
    Rules related to Rule 2 on string indexes:

    2a. Never use subtraction to obtain a string index.

    The code that violates this rule is similar to the code that violates Rule 2. For example,

    char* pLastChar = &szConfigFilename [strlen(szConfigFilename) - 1];

    This is the same effect as moving a pointer backward.

    Return to the difference between STR *** () and _ MBS *** ().

    Now, we should be clear why _ MBS *** () functions are required. The STR *** () function does not consider DBCS characters at all, but _ MBS. If you call strrchr ("C: //", ''//''), the returned result may be incorrect, but _ mbsrchr () the last double byte character is recognized, and a pointer pointing to the real ''//'' is returned.
    The last point about the string function: the STR *** () and _ MBS *** () functions assume that the length of the string is calculated using char. Therefore, if a string contains three double-byte characters, _ mbslen () returns 6. The length returned by the Unicode function is calculated based on wchar_t. For example, wcslen (L "Bob") returns 3.

    MBCS and Unicode in Win32 APIs

    Two groups of Apis:
    Although you may have never noticed that each string-related API and message in Win32 has two versions. One version accepts the MBCS string, and the other accepts the Unicode string. For example, there is no setwindowtext () API at all. On the contrary, there are setwindowtexta () and setwindowtextw (). Suffix A indicates that this is an MBCS function, and suffix W indicates that this is a unicode function.
    When you build a Windows program, you can choose MBCS or Unicode APIs. If you have used the VC wizard and have not modified the pre-processing settings, it indicates that you are using the MBCS version. So, since there is no setwindowtext () API, why can we use it? The winuser. h header file contains some macros, such:

    BOOL WINAPI SetWindowTextA ( HWND hWnd, LPCSTR lpString );BOOL WINAPI SetWindowTextW ( HWND hWnd, LPCWSTR lpString ); #ifdef UNICODE#define SetWindowText   SetWindowTextW#else#define SetWindowText   SetWindowTextA#endif      

    When you use MBCS APIs to build a program, Unicode is not defined, so the pre-processor sees:

    #define SetWindowText SetWindowTextA

    This macro definition converts all calls to setwindowtext to the real API function setwindowtexta. (Of course, you can directly call setwindowtexta () or setwindowtextw (), although you do not have to do that .)
    Therefore, if you want to change the default API function to the Unicode version, you can delete _ MBCS from the predefined macro list in the Preprocessor settings, then add Unicode and _ Unicode. (You need to define both, because different header files may use different macros .) However, if you use Char to define your string, you will be in an embarrassing situation. Consider the following code:

    HWND hwnd = GetSomeWindowHandle();char szNewText[] = "we love Bob!";SetWindowText ( hwnd, szNewText );

    After the Preprocessor replaces setwindowtext with setwindowtextw, the Code becomes:

    HWND hwnd = GetSomeWindowHandle();char szNewText[] = "we love Bob!";SetWindowTextW ( hwnd, szNewText );

    Have you seen the problem? We passed a single-byte string to a function that uses Unicode strings as parameters. The first solution to this problem is to use # ifdef to include the definition of string variables:

    HWND hwnd = GetSomeWindowHandle();#ifdef UNICODEwchar_t szNewText[] = L"we love Bob!";#elsechar szNewText[] = "we love Bob!";#endifSetWindowText ( hwnd, szNewText );

    You may already feel the headache that this will cause you. The perfect solution is to use tchar.

    Use tchar

    Tchar is a string type that allows you to use the same code when building a program using MBCS and unnicode, without the need for tedious macro definitions to include your code. Tchar is defined as follows:

    #ifdef UNICODEtypedef wchar_t TCHAR;#elsetypedef char TCHAR;#endif

    Therefore, when MBCS is used for building, tchar is Char, and Unicode is used, tchar is wchar_t. There is also a macro to process the L prefix required when defining Unicode string constants.

    #ifdef UNICODE#define _T(x) L##x#else#define _T(x) x#endif

    # Is A preprocessing operator that connects two parameters. If your code requires a String constant, add _ t macro before it. If you use Unicode for build, it will add the L prefix before the String constant.

    TCHAR szNewText[] = _T("we love Bob!");

    Like using macros to hide the details of setwindowtexta/W, there are many macros that you can use to implement string functions such as STR *** () and _ MBS. For example, you can use the _ tcsrchr macro to replace strrchr (), _ mbsrchr (), and wcsrchr (). _ Tcsrchr can be expanded to the correct function based on whether the predefined macro is _ MBCS or Unicode, just like setwindowtext.
    Not only does the STR *** () function have a tchar macro. Other functions such as _ stprintf (instead of sprinft () and swprintf (), _ tfopen (instead of fopen () and _ wfopen ()). In msdn, "generic-text routine mappings." has a complete macro list under the title.

    String and tchar typedefs

    Since the function list in the Win32 API document uses the common name of the function (for example, "setwindowtext"), all strings are defined using tchar. (Except for the Unicode-only API introduced in XP ). Some common typedefs are listed below. You can see them in msdn.

    Type Meaning in MBCS builds Meaning in Unicode builds
    Wchar Wchar_t Wchar_t
    Lpstr Zero-terminated string of char (char *) Zero-terminated string of char (char *)
    Lpcstr Constant zero-terminated string of char (const char *) Constant zero-terminated string of char (const char *)
    Lpwstr Zero-terminated Unicode string (wchar_t *) Zero-terminated Unicode string (wchar_t *)
    Lpcwstr Constant zero-terminated Unicode string (const wchar_t *) Constant zero-terminated Unicode string (const wchar_t *)
    Tchar Char Wchar_t
    Lptstr Zero-terminated string of tchar (tchar *) Zero-terminated string of tchar (tchar *)
    Lpctstr Constant zero-terminated string of tchar (const tchar *) Constant zero-terminated string of tchar (const tchar *)

    When to use tchar and Unicode

    Now, you may ask why Unicode is used. I have used char for many years. In the following three cases, Unicode will benefit you:

  • 1. Your program only runs in Windows NT.
  • 2. Your program needs to process file names that are longer than max_path.
  • 3. Your program needs to use the Unicode-only API introduced in XP.
  • In Windows 9x, most APIs do not implement the Unicode version. Therefore, if your program runs in Windows 9x, you must use MBCS APIs. However, since the NT System uses Unicode internally, using Unicode APIs will speed up your program. Each time you pass a string to call the mbcs api, the operating system converts the string to a unicode string and then calls the corresponding Unicode API. If a string is returned, the operating system will convert it back. Although the conversion process is highly optimized, the loss of speed is unavoidable.
    As long as you use the Unicode API, the NT system allows very long file names (exceeding the max_path limit, max_path = 260 ). Another advantage of using the Unicode API is that your program will automatically process various languages of user input. Therefore, a user can enter English, Chinese, or Japanese, and you do not need to write additional code to process them.
    Finally, as Windows 9x products fade out, Microsoft seems to be abandoning MBCS APIs. For example, the setwindowtheme () API that contains two string parameters only has the Unicode version. Using Unicode to build your program will simplify string processing, and you do not have to convert between MBCS and unicdoe.
    Even if you do not use Unicode to build your program, you should also use tchar and Its Related macros. In this way, not only can the code well process DBCS, but if you want to build your program with Unicode in the future, you only need to change the pre-processor settings to implement it.

    Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.