Unicode C/C ++ Programming

Source: Internet
Author: User
Tags uppercase letter knowledge base

Click here to view the original article

The biggest advantage of Unicode is that there is only one character set. In other words, a program using Unicode character encoding can be compiled in any country's compiling environment without being considered garbled, it can also display characters normally in the editing environment of any language, rather than garbled characters. Does Unicode have any disadvantages? Of course. Unicode strings occupy twice the memory of ASCII strings. However, compressing files can greatly reduce the disk space occupied by files.

For C Programming, you can use the wide character data type to increase support for Unicode programming when processing character data operations, so as to implement programs in multiple languages.

  Char Data Type

 

It is assumed that we are all very familiar with using the char data type in the C program to define and store characters and strings. However, to facilitate understanding how C handles wide characters, let's first review the standard character definitions that may appear in Win32 programs.

The following statement defines and initializes a variable that only contains one character:

char c = 'A' ; 

Variable C needs to be saved in 1 byte, and will be initialized with a hexadecimal number 0x41. This is the ASCII code of the letter.

You can define a pointer to a string like this:

char * p ; 

Because Windows is a 32-bit operating system, the pointer Variable P needs to be saved in four bytes. You can also initialize a pointer to a string:

char * p = "Hello!" ; 

As before, variable P needs to be saved in four bytes. This string is stored in static memory and occupies 7 bytes-6 bytes to save the string, and the other 1 byte to save the termination symbol 0.

You can also define character arrays like this:

char a[10] ; 

In this case, the compiler reserves 10 bytes of storage space for the array. Expression sizeof (a) returns 10. If the array is an overall variable (that is, defined outside all functions), you can use a statement like the following to initialize a character array:

char a[] = "Hello!" ; 

If you define this array as a function's regional variable, you must define it as a static variable, as shown below:

static char a[] = "Hello!" ; 

In either case, strings are stored in the static program memory, and 0 is added at the end. This requires 7 bytes of storage space.

 

Wide character 

Unicode or wide characters do not change the meaning of the char data type in C. Char continues to indicate the storage space of 1 byte, and sizeof (char) continues to return 1. Theoretically, one byte in C is longer than eight bits, but for most of us, one byte (that is, one char) is eight bits.

The wide character in C is based on the wchar_t data type. It is defined in several header files, including wchar. H, like this:

typedef unsigned short wchar_t ; 

Therefore, the wchar_t data type is the same as the unsigned short Integer type, and both are 16-bit width.

To define a variable that contains a wide character, use the following statement:

wchar_t c = 'A' ; 

The variable C is a double byte value of 0x0041, which is the Unicode letter. (However, because Intel microprocessor stores multi-byte values starting from the smallest byte, the bytes are actually saved in the memory in the order of 0x41 and 0x00. Note this if you check the computer storage of Unicode text .)

You can also define a pointer to a wide string:

wchar_t * p = L"Hello!" ; 

Note the uppercase letter L (representing "long") next to the first quotation mark "). This tells the compiler to save the string by wide characters-that is, each character occupies 2 bytes. Generally, the pointer Variable P occupies 4 bytes, while the string variable requires 14 bytes-each character requires 2 bytes, and the end 0 requires 2 bytes.

Similarly, you can use the following statement to define a wide character array:

static wchar_t a[] = L"Hello!" ; 

This string also requires 14 bytes of storage space, and sizeof (a) will return 14. Index Array A to obtain separate characters. The value of a [1] is the width character "E", or 0x0065.

Although it looks more like a printed symbol, the L before the first quotation mark is very important and there must be no space between the two symbols. Only with l can the compiler know that you need to save the string as 2 bytes per character. Later, when we see the wide string instead of variable definition, you will also encounter the L before the first quotation mark. Fortunately, if you forget to include L, the C compiler usually sends a warning or error message.

You can also use the L prefix before a single character to indicate that they should be interpreted as wide characters. As follows:

wchar_t c = L'A' ; 

But this is usually unnecessary. The C compiler will expand the character to make it a wide character.

Wide character Link Library Function

We all know how to get the length of a string. For example, if we have defined a string pointer as follows:

char * pc = "Hello!" ; 

We can call

iLength = strlen (pc) ; 

At this time, the variable ilength is equal to 6, that is, the number of characters in the string.

Great! Now let's try to define a pointer to a wide character:

wchar_t * pw = L"Hello!" ; 

Call strlen again:

iLength = strlen (pw) ; 

Now we are in trouble. First, the C compiler will display a warning message, which may be:

'Function': incompatible types-from 'unsigned short * 'to 'const char *'

This message indicates that when the strlen function is declared, the function should receive the char type indicator, but it now receives an unsigned short type indicator. You can still compile and run the program, but you will find that ilength is equal to 1. Why?

String "Hello !" The six characters in the string take up 16 characters:

0x0048 0x0065 0x006C 0x006C 0x006F 0x0021 

The intel processor saves the following in memory:

48 00 65 00 6C 00 6C 00 6F 00 21 00 

Assume that the strlen function is trying to get the length of a string and count 1st bytes as characters. If the next byte is 0, the string ends.

This small exercise clearly shows the differences between the C language itself and the linked library functions in the runtime. The compiler splits the string l "Hello! "Is interpreted as a group of 16-bit short integer data and saved in the wchar_t array. The compiler also processes array indexes and sizeof operators, so these operations can work normally, but the runtime linked library function, such as strlen, is added only when linking. These functions assume that a string consists of single-byte characters. When a wide string is encountered, the function does not run as expected.

You may say, "Oh, it's too much trouble !" Currently, each c-language linked library function must be rewritten to accept wide characters. But in fact, not every c-language linked library function needs to be rewritten, but those functions with string parameters need to be rewritten, and you do not need to complete it. They have been overwritten.

The wide character version of The strlen function is wcslen (wide-Character String Length: width string length), which is described in string. H (strlen) and wchar. h. Strlen functions are described as follows:

size_t __cdecl strlen (const char *) ; 

The wcslen function is described as follows:

size_t __cdecl wcslen (const wchar_t *) ; 

At this time, we know that to obtain the length of a wide string, we can call

iLength = wcslen (pw) ; 

The function returns 6 Characters in the string. Remember, the character length of the character string is not changed after the character segment is changed to a wide character segment, but the length of the bit group is changed.

All the C Runtime Linked Library functions you are familiar with have wide character versions. For example, wprintf is a wide character version of printf. These functions are described in wchar. h and in the header file containing the standard function description.

 

Of course, Unicode also has disadvantages. The first and most important point is that every string in the program occupies twice the storage space. In addition, you will find that the function in the Link Library during the wide character Runtime is larger than the conventional function. For this reason, you may want to create two versions of programs-one for processing ASCII strings and the other for processing Unicode strings. The best solution is to maintain a single source code file that can be compiled by ascii or Unicode.

Although it is only a short program, you need to define different characters because the Linked Library functions have different names during the runtime, which will cause trouble when processing the string text with L in front.

One way is to use the tchar. h header file included in Microsoft Visual C ++. This header file is not part of the ansi c standard, so each function and macro definition defined there has a bottom line. Tchar. H provides a series of alternative names (for example, _ tprintf and _ tcslen) for the standard runtime Linked Library functions that require string parameters ). Sometimes these names are also called "common" function names, because they can point either to the Unicode or non-Unicode version of the function.

If the identifier named _ Unicode is defined and the program contains the tchar. h header file, _ tcslen is defined as wcslen:

#define _tcslen wcslen 

If Unicode is not defined, _ tcslen is defined as strlen:

#define _tcslen strlen 

And so on. Tchar. H also uses a new data type tchar to solve the problem of two character data types. If _ Unicode identifier is defined, tchar is wchar_t:

typedef wchar_t TCHAR ; 

Otherwise, tchar is Char:

typedef char TCHAR ; 

Now we will discuss the question of L in string text. If _ Unicode identifier is defined, a macro called _ t is defined as follows:

#define __T(x) L##x 

This is rather obscure syntax, but complies with ansi c-standard Preprocessor specifications. The pair of Well fonts is called "token paste", which adds the letter L to the macro parameter. Therefore, if the macro parameter is "Hello! ", Then l # X is l" Hello! ".

If the _ Unicode identifier is not defined, the _ t macro is simply defined as follows:

#define __T(x) x 

In addition, two macros have the same definition as _ t:

#define _T(x) __T(x) #define _TEXT(x) __T(x) 

Which macro is used in the Win32 console program depends on whether you prefer to be concise or detailed. Basically, the string text must be defined in the _ T or _ text macro as follows:

_TEXT ("Hello!") 

In this case, if _ Unicode is defined, the string is interpreted as a combination of wide characters, otherwise it is interpreted as an 8-bit character string.

 

Wide character and windows 

Windows NT supports Unicode from the underlying layer. This means that Windows NT uses a string consisting of 16 characters. Because 16-bit strings are not used in many other parts of the world, Windows NT must often convert strings within the operating system. Windows NT supports programs written in combination with ASCII, Unicode, or ASCII and Unicode. That is, Windows NT supports different API function calls. These functions accept 8-bit or 16-bit strings (we will immediately see how this works .)

Compared with Windows NT, Windows 98 has less Unicode support. Only a few windows 98 function calls support wide strings (these functions are listed in Microsoft Knowledge Base Article q125671; they include MessageBox ). If only one. EXE file in the program to be released must be run in both Windows NT and Windows 98, Unicode should not be used; otherwise, Unicode cannot be used in Windows
98. In particular, the program cannot call Unicode Windows functions. In this way, the Unicode version of the program will be released in a more favorable position in the future, you are eager to write both ASCII and Unicode compiled source code. This is how all programs are written in this book.

 

Windows header file type 

As you can see in chapter 1, a Windows program includes the header file windows. h. This file contains many other header files, including windef. H. This file contains many basic State definitions used in windows, and also contains winnt. h. Winnt. h supports basic Unicode processing.

The front of WINNT. h contains the header file ctype. h of C, which is one of the many header files of C, including the definition of wchar_t. Winnt. h defines a new data type, called char and wchar:

typedef char CHAR ; typedef wchar_t WCHAR ; // wc 

When you need to define 8 or 16 characters, we recommend that you use char and wchar in windows. The comments behind the wchar definition are recommended by the Hungarian markup method: a variable based on the wchar data type can be appended with a letter WC to describe a wide character.

The winnt. h header file further defines six data types that can be used as 8-Bit String pointers and four data types that can be used as const 8-Bit String pointers. Here we have selected some useful data type statements in the header file:

typedef CHAR * PCHAR, * LPCH,* PCH, * NPSTR,* LPSTR,* PSTR ; typedef CONST CHAR * LPCCH,* PCCH, * LPCSTR, * PCSTR ; 

Prefix N and l indicate "near" and "long", which indicate two indicators of different sizes in 16-bit windows. In Win32, the near and long indicators are no different.

Similarly, winnt. h defines six data types that can be used as a 16-Bit String pointer and four data types that can be used as a const 16-Bit String pointer:

typedef WCHAR * PWCHAR, * LPWCH, * PWCH,* NWPSTR,* LPWSTR,* PWSTR ; typedef CONST WCHAR * LPCWCH,* PCWCH, * LPCWSTR,* PCWSTR ; 

So far, we have data types char (an 8-bit char) and wchar (a 16-bit wchar_t), as well as indicators pointing to Char and wchar. Like tchar. H, winnt. h defines tchar as a general character type. If the identifier Unicode (no bottom line) is defined, the tchar and the indicator pointing to the tchar are defined as the wchar and the indicator pointing to the wchar respectively. If the identifier Unicode is not defined, tchar and the indicator pointing to tchar are defined as char and the indicator pointing to Char respectively:

#ifdef UNICODE typedef WCHAR TCHAR, * PTCHAR ; typedef LPWSTR LPTCH, PTCH, PTSTR, LPTSTR ; typedef LPCWSTR LPCTSTR ; #else typedef char TCHAR, * PTCHAR ; typedef LPSTR LPTCH, PTCH, PTSTR, LPTSTR ; typedef LPCSTR LPCTSTR ; #endif 

If the tchar data type has been defined in a header file or other header files, both the WINNT. h and wchar. h header files can prevent repeated definitions. However, whenever other header files are used in the program, windows. H should be included before all other header files.

The winnt. h header file also defines a macro that adds L to the first quotation mark of the string. If a unicode identifier is defined, a macro called _ text is defined as follows:

#define __TEXT(quote) L##quote 

If no identifier Unicode is defined, the _ text macro is defined as follows:

#define __TEXT(quote) quote 

In addition, the text macro can be defined as follows:

#define TEXT(quote) __TEXT(quote) 

This is the same as the _ text macro defined in tchar. h, but you don't have to worry about the bottom line. I will use the text version of this macro in this book.

These definitions allow you to mix ASCII and Unicode strings in the same program, or compile a program that can be compiled by ascii or Unicode. If you want to explicitly define 8-character variables and strings, use char, pchar (or other), and a string with quotation marks. To explicitly use 16-character variables and strings, use wchar and pwchar and add L before quotation marks. For 8-or 16-bit variables or strings defined by Unicode identifiers, tchar, ptchar, and text macros are used.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.