Win32 character encoding

Source: Internet
Author: User

All string classes are based on C-style strings. The C-style string is a character array.

Three character encoding types-- ASCII, DBCS, Unicode

Three encoding modes correspond to three character types

The first encoding type is the single-byte character set or sbcs ). In this encoding mode, all characters are represented in only one byte. ASCII is sbcs. The value 0 in one byte indicates the end of The sbcs string.
The second encoding mode is the multi-byte character set or MBCS ). An MBCS encoding contains some characters long in one byte, while others are larger than the length of one byte. In Windows, MBCS contains two character types: single-byte characters and double-byte characters ). Because most of the Multi-byte characters used in windows are two bytes long, MBCS is often replaced by DBCS.

In DBCS encoding mode, some specific values are reserved to indicate that they are part of dubyte characters. For example, in shift-JIS encoding (a common Japanese encoding mode), the value between 0x81-0x9f and 0xe0-oxfc indicates "This Is A dubyte character, the next subsection is a part of this character. "Such values are called" leading bytes ", and they are all greater than 0x7f. The Byte following a leading byte subsection is called "trail byte ". In DBCS, the trail byte can be any value other than 0. Like sbcs, the ending mark of the DBCS string is also 0 represented by a single byte.
The third encoding mode is Unicode. Unicode is a two-byte encoding mode for all characters. Unicode characters are sometimes called wide characters because they are wider than single-byte characters (more storage space is used ). Note that Unicode cannot be considered as MBCS. The unique feature of MBCS is that its characters are encoded in bytes of different lengths. A Unicode string uses 0 in two bytes as its end flag.
The single-byte character contains the Latin alphabet, accented characters, and ASCII standard and graphic characters defined by the DOS operating system. Dubyte characters are used to represent the languages of East Asia and the Middle East. Unicode is used in the COM and Windows NT operating systems.

You must be familiar with single-byte characters. When you use char, you are processing single-byte characters. Double-byte characters are also operated using the char type (this is one of the many strange things we will see about the double-byte characters ). Unicode characters are represented by wchar_t. Unicode characters and string constants are expressed by the prefix L. For example:

wchar_t wch = L''1''; // 2 bytes, 0x0031wchar_t* wsz = L"Hello"; // 12 bytes, 6 wide characters
Use string processing functionsWe have seen string functions, strcpy (), sprintf (), and Atoll () in C language. These strings should only be used to process single-byte character strings. The standard library also provides functions that only apply to Unicode strings, such as wcscpy (), swprintf (), and wtol. Microsoft also added the DBCS string operating version in its CRT (C Runtime Library. The STR *** () function has the DBCS version _ MBS *** () corresponding to the name ***(). If you expected to encounter a DBCS string (If your software will be installed in countries encoded with DBCS, such as China and Japan, you may ), you should use the _ MBS *** () function because they can also process sbcs strings. (A DBCS string may also contain single-byte characters, which is why the _ MBS *** () function can also process sbcs strings)
Correct traversal and index stringMost of us use the sbcs string to grow, so we often use the ++ and-operations of pointers when traversing strings. We also use the representation of the array icon to manipulate characters in the string. These two methods are used for sbcs and Unicode strings, because the characters in them share the same width, the compiler can correctly return the characters we need. However, when encountering a DBCS string, We must discard these habits. Here there are two rules for traversing the DBCS string using pointers. If you violate these two rules, your program will have a Bugs related to DBCS.

  • 1. Do not use the ++ operation for forward time, unless you check lead byte every time;
  • 2. Never use-operations for backward traversal.
  •  
    MBCS and Unicode in Win32 APIsTwo groups of Apis: although you may have never noticed that each string-related API and message in Win32 has two versions. One version accepts the MBCS string, and the other accepts the Unicode string. For example, there is no setwindowtext () API at all. On the contrary, there are setwindowtexta () and setwindowtextw (). Suffix A indicates that this is an MBCS function, and suffix W indicates that this is a unicode function. When you build a Windows program, you can choose MBCS or Unicode APIs. If you have used the VC wizard and have not modified the pre-processing settings, it indicates that you are using the MBCS version. So, since there is no setwindowtext () API, why can we use it? The winuser. h header file contains some macros, such:
    BOOL WINAPI SetWindowTextA ( HWND hWnd, LPCSTR lpString );BOOL WINAPI SetWindowTextW ( HWND hWnd, LPCWSTR lpString ); #ifdef UNICODE#define SetWindowText  SetWindowTextW#else#define SetWindowText  SetWindowTextA#endif 

     

    Use tchar

    Tchar is a string type that allows you to use the same code when building a program using MBCS and unnicode, without the need for tedious macro definitions to include your code. Tchar is defined as follows:

    #ifdef UNICODEtypedef wchar_t TCHAR;#elsetypedef char TCHAR;#endif

    Therefore, when MBCS is used for building, tchar is Char, and Unicode is used, tchar is wchar_t. There is also a macro to process the L prefix required when defining Unicode string constants.

    #ifdef UNICODE#define _T(x) L##x#else#define _T(x) x#endif

    # Is A preprocessing operator that connects two parameters. If your code requires a String constant, add _ t macro before it. If you use Unicode for build, it will add the L prefix before the String constant.

    When to use tchar and Unicode

    Now, you may ask why Unicode is used. I have used char for many years. In the following three cases, Unicode will benefit you:

  • 1. Your program only runs in Windows NT.
  • 2. Your program needs to process file names that are longer than max_path.
  • 3. Your program needs to use the Unicode-only API introduced in XP.
  • In Windows 9x, most APIs do not implement the Unicode version. Therefore, if your program runs in Windows 9x, you must use MBCS APIs. However, since the NT System uses Unicode internally, using Unicode APIs will speed up your program. Each time you pass a string to call the mbcs api, the operating system converts the string to a unicode string and then calls the corresponding Unicode API. If a string is returned, the operating system will convert it back. Although the conversion process is highly optimized, the loss of speed is unavoidable.
    As long as you use the Unicode API, the NT system allows very long file names (exceeding the max_path limit, max_path = 260 ). Another advantage of using the Unicode API is that your program will automatically process various languages of user input. Therefore, a user can enter English, Chinese, or Japanese, and you do not need to write additional code to process them.
    Finally, as Windows 9x products fade out, Microsoft seems to be abandoning MBCS APIs. For example, the setwindowtheme () API that contains two string parameters only has the Unicode version. Using Unicode to build your program will simplify string processing, and you do not have to convert between MBCS and unicdoe.
    Even if you do not use Unicode to build your program, you should also use tchar and Its Related macros. In this way, not only can the code well process DBCS, but if you want to build your program with Unicode in the future, you only need to change the pre-processor settings to implement it.

     

     

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.