One of the complete guidelines for C ++ strings (Win32 character encoding)

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: Michael Dunn category: VC ++ recommendation index:★★★★Popularity: 400 this week popularity: 14 Release Date: introduction no doubt, we have seen a variety of string types such as tchar, STD: String, BSTR, etc, there are also strange macros starting with _ TCS. You may be staring at the monitor. This guide will summarize the purpose of introducing various character types, demonstrate some simple usage, and show you how to convert string types when necessary. In the first part, we will introduce three character encoding types. It is very important to understand how various encoding modes work. Even if you already know that a string is a character array, you should read this section. Once you understand this, you will have a clear understanding of the relationships between various string types. In the second part, we will separately describe the string class, how to use it and implement conversion between them. Character Base-All string classes of ASCII, DBCS, and Unicode are based on C-style strings. The C-style string is a character array. So we will first introduce the character type. Three encoding modes correspond to three character types. The first encoding type is the single-byte character set or sbcs ). In this encoding mode, all characters are represented in only one byte. ASCII is sbcs. The value 0 in one byte indicates the end of The sbcs string. The second encoding mode is the multi-byte character set or MBCS ). An MBCS encoding contains some characters long in one byte, while others are larger than the length of one byte. In Windows, MBCS contains two character types: single-byte characters and double-byte characters ). Because most of the Multi-byte characters used in windows are two bytes long, MBCS is often replaced by DBCS. In DBCS encoding mode, some specific values are reserved to indicate that they are part of dubyte characters. For example, in shift-JIS encoding (a common Japanese encoding mode), the value between 0x81-0x9f and 0xe0-oxfc indicates "This Is A dubyte character, the next subsection is a part of this character. "Such values are called" leading bytes ", and they are all greater than 0x7f. The Byte following a leading byte subsection is called "trail byte ". In DBCS, the trail byte can be any value other than 0. Like sbcs, the ending mark of the DBCS string is also 0 represented by a single byte. The third encoding mode is Unicode. Unicode is a two-byte encoding mode for all characters. Unicode characters are sometimes called wide characters because they are wider than single-byte characters (more storage space is used ). Note that Unicode cannot be considered as MBCS. The unique feature of MBCS is that its characters are encoded in bytes of different lengths. A Unicode string uses 0 in two bytes as its end flag. The single-byte character contains the Latin alphabet, accented characters, and ASCII standard and graphic characters defined by the DOS operating system. Dubyte characters are used to represent the languages of East Asia and the Middle East. Unicode is used in the COM and Windows NT operating systems. You must be familiar with single-byte characters. When you use char, you are processing single-byte characters. Double-byte characters are also operated using the char type (this is one of the many strange things we will see about the double-byte characters ). Unicode characters are represented by wchar_t. Unicode characters and string constants are expressed by the prefix L. For example: wchar_t wch = l''1''; // 2 bytes, 0x0031 wchar_t * wsz = l "hello"; // 12 bytes, 6. How the wide characters character stores a single-byte string in the memory: each character occupies one byte and stores it in sequence, and ends with 0 represented by a single byte. For example. The storage format of "Bob" is as follows: 42 6f 62 00 B o B Bos Unicode storage format, L "Bob" 42 00 6f 00 62 00 00 B o B Bos uses two bytes of 0 for the end mark. At a glance, the DBCS string is very similar to the sbcs string, but we will see the nuances of the DBCS string in a moment, which will produce unexpected results when traversing a string using string operation functions and permanent character pointers. The storage format of string "(" nihongo ") in memory is as follows (LB and TB are used to represent leading byte and trail byte respectively) 93 fa 96 7b 8C EA 00 lb tb eos it is worth noting that the "Ni" value cannot be interpreted as the word value 0xfa93, the two values 93 and FA are encoded as "Ni" in this order. We have seen string functions, strcpy (), sprintf (), Atoll (), and so on in C language. These strings should only be used to process single-byte character strings. The standard library also provides functions that only apply to Unicode strings, such as wcscpy (), swprintf (), and wtol. Microsoft also added the DBCS string operating version in its CRT (C Runtime Library. The STR *** () function has the DBCS version _ MBS *** () corresponding to the name ***(). If you expected to encounter a DBCS string (If your software will be installed in countries encoded with DBCS, such as China and Japan, you may ), you should use the _ MBS *** () function because they can also process sbcs strings. (A DBCS string may also contain single-byte characters, which is why the _ MBS *** () function can also process sbcs strings) let's look at a typical string to illustrate why different versions of string processing functions are needed. We still use the Unicode string l "Bob": 42 00 6f 00 62 00 00 00 B o B Bos because x86cpu is little-Endian, the value 0x0042 is stored in the memory as 42 00. Can you see what will happen if this string is passed to the strlen () function? It will first see the first byte 42, then 00, and 00 is the end sign of the string, so strlen () will return 1. If "Bob" is passed to wcslen (), the worse result will be obtained. Wcslen () will first see 0x6f42, then 0x0062, and then read at the end of your buffer until it finds that the 00 00 end mark or caused GPF. So far, we have discussed the usage and differences between STR *** () and WCS. What is the difference between STR *** () and _ MBS? Understanding the differences between them is very important to use the correct method to traverse the DBCS string. Next, we will first introduce the traversal of strings, and then return to the difference between STR *** () and _ MBS. Correct traversal and index string because most of us use the sbcs string to grow, we often use the ++ and-operations of pointers when traversing strings. We also use the representation of the array icon to manipulate characters in the string. These two methods are used for sbcs and Unicode strings, because the characters in them share the same width, the compiler can correctly return the characters we need. However, when encountering a DBCS string, We must discard these habits. Here there are two rules for traversing the DBCS string using pointers. If you violate these two rules, your program will have a Bugs related to DBCS. 1. Do not use the ++ operation for forward traversal unless you check lead byte every time. 2. Never use the-operation for backward traversal. Let's explain rule 2 first, because it is easy to find a real instance code that violates it. Suppose you have a program that saves a setting file in your own directory, and you save the installation directory in the registry. At runtime, you read the installation directory from the Registry, synthesize the configuration file name, and then read the file. Suppose that your installation directory is C:/program files/mycoolapp, then the file name you synthesize should be C:/program files/mycoolapp/config. Bin. When you test the program, you find that the program runs normally. Now, imagine that the code for merging file names may be like this: bool getconfigfilename (char * pszname, size_t nbuffsize) {char szconfigfilename [max_path]; // read install dir from registry... we ''ll assume it succeeds. // Add on a backslash if it wasn' t present in the registry value. // first, get a pointer to the terminating zero. char * plastchar = strchr (szconfigfilename, ''/0''); // now move it back one character. plastchar --; I F (* plastchar! = ''//'') Strcat (szconfigfilename, "//"); // Add on the name of the config file. strcat (szconfigfilename, "config. bin "); // If the caller's buffer is big enough, return the filename. if (strlen (szconfigfilename)> = nbuffsize) return false; else {strcpy (pszname, szconfigfilename); Return true ;}} this is a very robust piece of code, however, an error occurs when you encounter a DBCS character. Let's see why. Suppose a Japanese user uses your program and installs it in C :/. The storage format of this name in memory is as follows: 43 3A 5C 83 88 83 45 83 52 83 5C 00 lb tb c:/EOS when getconfigfilename () is used () when you check the tail ''' // '', it looks for the last non-zero byte in the installation directory name and determines that it is equal, so no more ''//'' is added ''//''. The result is that the Code returns the wrong file name. What went wrong? Look at the above two byte values displayed in blue. The slash ''' value is 0x5c. The value of ''' is 83 5C. The code above incorrectly reads a trail byte and treats it as a character. The correct backward Traversal method is to use a function that can recognize DBCS characters to move the pointer to the correct number of bytes. The following is the correct code. (The pointer is marked in red) bool fixedgetconfigfilename (char * pszname, size_t nbuffsize) {char szconfigfilename [max_path]; // read install dir from registry... we ''ll assume it succeeds. // Add on a backslash if it wasn' t present in the registry value. // first, get a pointer to the terminating zero. char * plastchar = _ mbschr (szconfigfilename, ''/0''); // now move it back one double-byte character. PLAs Tchar = charprev (szconfigfilename, plastchar); If (* plastchar! = ''/'') _ Mbscat (szconfigfilename, "//"); // Add on the name of the config file. _ mbscat (szconfigfilename, "config. bin "); // If the caller's buffer is big enough, return the filename. if (_ mbslen (szinstalldir)> = nbuffsize) return false; else {_ mbscpy (pszname, szconfigfilename); Return true ;}} the above function uses charprev () the API allows plastchar to move one character backward, which may be two bytes long. In this version, the IF condition works normally because lead byte will never be equal to 0x5c. Let's imagine an occasion that violates Rule 1. For example, you may want to check whether the file name entered by a user appears '':'' multiple times '':''. If you use ++ to traverse strings, instead of charnext (), you may issue an incorrect error warning. If a trail byte has a value equal '': ''value. Rules related to Rule 2 on string indexes: 2a. Never use subtraction to obtain a string index. The code that violates this rule is similar to the code that violates Rule 2. For example, char * plastchar = & szconfigfilename [strlen (szconfigfilename)-1]; this is the same effect as moving a pointer backward. Back to the difference between STR *** () and _ MBS *** (), we should be clear about why _ MBS *** () is necessary. The STR *** () function does not consider DBCS characters at all, but _ MBS. If you call strrchr ("C: //", ''//''), the returned result may be incorrect, but _ mbsrchr () the last double byte character is recognized, and a pointer pointing to the real ''//'' is returned. The last point about the string function: the STR *** () and _ MBS *** () functions assume that the length of the string is calculated using char. Therefore, if a string contains three double-byte characters, _ mbslen () returns 6. The length returned by the Unicode function is calculated based on wchar_t. For example, wcslen (L "Bob") returns 3. MBCS and Unicode APIs in Win32 APIs: although you may have never noticed that each string-related API and message in Win32 has two versions. One version accepts the MBCS string, and the other accepts the Unicode string. For example, there is no setwindowtext () API at all. On the contrary, there are setwindowtexta () and setwindowtextw (). Suffix A indicates that this is an MBCS function, and suffix W indicates that this is a unicode function. When you build a Windows program, you can choose MBCS or Unicode APIs. If you have used the VC wizard and have not modified the pre-processing settings, it indicates that you are using the MBCS version. So, since there is no setwindowtext () API, why can we use it? Winuser. the H header file contains some macros, such as: bool winapi setwindowtexta (hwnd, lpcstr lpstring); bool winapi setwindowtextw (hwnd, lpcwstr lpstring ); # ifdef Unicode # define setwindowtext setwindowtextw # else # define setwindowtext setwindowtexta # endif when using MBCS APIs to build a program, Unicode is not defined, so the pre-processor can see: # define setwindowtext setwindowtexta this macro definition converts all calls to setwindowtext into a real API function setwindowtexta. (Of course, you can directly call setwindowtexta () or setwindowtextw (), although you do not have to do that .) Therefore, if you want to change the default API function to the Unicode version, you can delete _ MBCS from the predefined macro list in the Preprocessor settings, then add Unicode and _ Unicode. (You need to define both, because different header files may use different macros .) However, if you use Char to define your string, you will be in an embarrassing situation. Consider the following code: hwnd = getsomewindowhandle (); char sznewtext [] = "we love Bob! "; Setwindowtext (hwnd, sznewtext); after the Preprocessor replaces setwindowtext with setwindowtextw, the Code becomes: hwnd = getsomewindowhandle (); char sznewtext [] =" we love Bob! "; Setwindowtextw (hwnd, sznewtext); have you seen the problem? We passed a single-byte string to a function that uses Unicode strings as parameters. The first solution to this problem is to use # ifdef to include the definition of string variables: hwnd = getsomewindowhandle (); # ifdef Unicode wchar_t sznewtext [] = l "we love Bob! "; # Else char sznewtext [] =" we love Bob! "; # Endif setwindowtext (hwnd, sznewtext); you may already feel the headache of doing so. The perfect solution is to use tchar. using tchar is a string type that allows you to use the same code when building programs using MBCS and unnicode, without the need for tedious macro definitions to include your code. Tchar is defined as follows: # ifdef Unicode typedef wchar_t tchar; # else typedef char tchar; # endif when MBCS is used for build, tchar is Char, and Unicode is used, tchar is wchar_t. There is also a macro to process the L prefix required when defining Unicode string constants. # Ifdef Unicode # DEFINE _ T (x) L # X # else # DEFINE _ T (x) x # endif # Is A preprocessing operator, it connects two parameters. If your code requires a String constant, add _ t macro before it. If you use Unicode for build, it will add the L prefix before the String constant. Tchar sznewtext [] = _ T ("we love Bob! "); Like using a macro to hide the details of setwindowtexta/W, there are many macros you can use to implement STR *** () and _ MBS ***() and other string functions. For example, you can use the _ tcsrchr macro to replace strrchr (), _ mbsrchr (), and wcsrchr (). _ Tcsrchr can be expanded to the correct function based on whether the predefined macro is _ MBCS or Unicode, just like setwindowtext. Not only does the STR *** () function have a tchar macro. Other functions such as _ stprintf (instead of sprinft () and swprintf (), _ tfopen (instead of fopen () and _ wfopen ()). In msdn, "generic-text routine mappings." has a complete macro list under the title. String and tchar typedefs because the function list in the Win32 API documentation uses the common name of the function (for example, "setwindowtext"), all strings are defined using tchar. (Except for the Unicode-only API introduced in XP ). Some common typedefs are listed below. You can see them in msdn. Type meaning in MBCS builds meaning in Unicode builds wchar wchar_t lpstr zero-terminated string of char (char *) Zero-terminated string of char (char *) lpcstr constant zero-terminated string of char (const char *) lpwstr zero-terminated Unicode string (wchar_t *) zero-terminated Unicode string (wchar_t *) lpcwstr constant zero-terminated u Nicode string (const wchar_t *) constant zero-terminated Unicode string (const wchar_t *) tchar char wchar_t lptstr zero-terminated string of tchar (tchar *) zero-terminated string of tchar (tchar *) lpctstr constant zero-terminated string of tchar (const tchar *) when to use tchar and Unicode until now, you may ask why we should use Unicode. I have used char for many years. In the following three cases, Unicode will benefit you: 1. Your program runs only on Windows NT. 2. Your program needs to process file names that are longer than max_path. 3. Your program needs to use the Unicode-only API introduced in XP. Most of the APIs in Windows 9x do not implement the Unicode version. Therefore, if your program runs in Windows 9x, you must use MBCS APIs. However, since the NT System uses Unicode internally, using Unicode APIs will speed up your program. Each time you pass a string to call the mbcs api, the operating system converts the string to a unicode string and then calls the corresponding Unicode API. If a string is returned, the operating system will convert it back. Although the conversion process is highly optimized, the loss of speed is unavoidable. As long as you use the Unicode API, the NT system allows very long file names (exceeding the max_path limit, max_path = 260 ). Another advantage of using the Unicode API is that your program will automatically process various languages of user input. Therefore, a user can enter English, Chinese, or Japanese, and you do not need to write additional code to process them. Finally, as Windows 9x products fade out, Microsoft seems to be abandoning MBCS APIs. For example, the setwindowtheme () API that contains two string parameters only has the Unicode version. Using Unicode to build your program will simplify string processing, and you do not have to convert between MBCS and unicdoe. Even if you do not use Unicode to build your program, you should also use tchar and Its Related macros. In this way, not only can the code well process DBCS, but if you want to build your program with Unicode in the future, you only need to change the pre-processor settings to implement it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

One of the complete guidelines for C ++ strings (Win32 character encoding)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

One of the complete guidelines for C ++ strings (Win32 character encoding)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support