Unicode programming in VC ++
Author: Han yaoxu
Download source code
1. What is Unicode?
Start with ASCII. ASCII is an encoding standard used to represent English characters. Each ASCII character occupies 1 byte. Therefore, the maximum number of characters that can be represented by ASCII encoding is 255 (00H-FFH ). In fact, there are not so many English characters, generally only the first 128 (00H-7FH, the highest bit is 0), including control characters, numbers, uppercase and lowercase letters and other symbols. The other 128 characters (80H-FFH) with the highest bit of 1 are called "extended ASCII" and are generally used to store English tabs, some phonetic symbols, and other symbols.
This character encoding rule is clearly used to deal with English. However, in the face of complicated texts such as Chinese and Arabic, the 255 characters are obviously not enough.
As a result, various countries have developed their own text encoding specifications, the Chinese text encoding specification is called "GB2312-80", it is compatible with ASCII code specification, in fact, extended ASCII is not actually standardized. A Chinese character is represented by two extended ASCII characters to distinguish the ASCII part.
However, there is a problem with this method. The biggest problem is that the Chinese text encoding overlaps with the extended ASCII code. Many software use English tabs with extended ASCII codes to draw tables. Such software uses Chinese systems, and these tables are mistakenly recognized as Chinese characters and contain garbled characters.
In addition, because countries and regions all have their own character encoding rules, they conflict with each other, which makes information exchange between countries and regions a lot of trouble.
To really solve this problem, we should not start from the perspective of extended ASCII. Instead, we must have a brand new encoding system that can include Chinese, French, and German ...... And so on.
So Unicode was born.
Unicode is also a character encoding method. It occupies two bytes (0000H-FFFFH) and contains 65536 characters, which can fully accommodate the encoding of all languages in the world.
In Unicode, all characters are treated equally. Chinese characters no longer use "two extended ASCII", but use "1 Unicode". That is to say, all texts are processed by one character, they all have a unique Unicode code.
Ii. Benefits of using Unicode encoding
Unicode encoding allows your project to support multiple languages at the same time, so that your project can be internationalized.
In addition, Windows NT is developed using Unicode, and the entire system is Unicode-based. If you call an API function and pass it an ANSI (ASCII character set and a derived and compatible character set, such as GB2312, usually called the ANSI character set) string, the system must first convert the string to Unicode, and then pass the Unicode string to the operating system. If you want the function to return an ANSI string, the system first converts the Unicode string to an ANSI string and then returns the result to your application. To convert these strings, the system time and memory are required. If Unicode is used to develop applications, your applications can run more efficiently.
The following code illustrates the differences between ANSI and Unicode:
Character |
A |
N |
And |
ANSI code |
41 h |
4eh |
Cdbah |
Unicode code |
0041 H |
004eh |
548ch |
Iii. Unicode programming using C ++
The support for wide characters is actually part of the ansi c standard, used to support multi-byte expression of a character. The width character is not exactly the same as the Unicode character. Unicode is only a type of width character encoding.
1. Definition of wide characters
In ANSI, the length of a character (char) is one Byte ). When Unicode is used, a character occupies one word. C ++ defines the most basic wide character type wchar_t in the wchar. h header file:
typedef unsigned short wchar_t;
Here we can clearly see that the so-called wide character is an unsigned short integer.
2. constant width string
For C ++ programmers, constructing string constants is a regular task. So how to construct a wide Character String constant? Simply add an uppercase L to the String constant, for example:
wchar_t *str1=L" Hello";
This l is very important. The Compiler only knows that you want to save the string as a character. Note that there must be no space between string and L.
3. Wide string library functions
C ++ specifically defines a set of functions to operate on wide strings. For example, the function to evaluate the length of a wide string is
size_t __cdel wchlen(const wchar_t*);
Why do we need to define these functions? The most fundamental reason is that all strings in ANSI are identified by '\ 0' at the end of the string (the Unicode string ends with "\ 0 \ 0 ), the correct operations on many string functions are based on this. However, we know that a character occupies one space in the memory when it is a wide character, which will make the string function that operates the ANSI character unable to operate correctly. Take the "hello" string as an example. The following five characters are contained in the string:
0x0048 0x0065 0x006c 0x006c 0x006f
In the memory, the actual arrangement is:
48 00 65 00 6c 00 6c 00 6f 00
Therefore, when an ANSI string function, such as strlen, encounters the first 00 after 48, it will consider the string to the end, the result of using strlen to evaluate the length of a wide string will always be 1!
4. Macro-based programming for ANSI and Unicode
It can be seen that C ++ has a complete set of data types and functions for Unicode programming, that is, you can use C ++ for Unicode programming.
If we want our program to have two versions: ANSI and Unicode. Of course, writing two sets of code to achieve both the ANSI version and the Unicode version is completely feasible. However, it is very troublesome to maintain two sets of code for ANSI and Unicode characters. To reduce the programming burden, C ++ defines a series of macros to help you implement generic programming for ANSI and Unicode.
The essence of general programming for ANSI and Unicode in C ++ macros is defined based on _ Unicode (note, underline). These macros are expanded to ANSI or Unicode characters (strings ).
Some code in the tchar. h header file is excerpted as follows:
#ifdef _UNICODEtypedef wchar_t TCHAR;#define __T(x) L##x#define _T(x) __T(x)#else#define __T(x) xtypedef char TCHAR;#endif
It can be seen that these macros are expanded to ANSI or Unicode characters based on the definition of "_ Unicode. The macros defined in the tchar. h header file can be divided into two types:
A. We only list the two most common macros that implement the definition of characters and constant strings:
Macro |
Undefined _ Unicode (ANSI character) |
_ Unicode (UNICODE character) is defined) |
Tchar |
Char |
Wchar_t |
_ T (X) |
X |
L ## x |
Note:
"#" Is the ansi c pre-processing syntax. It is called "paste symbol", which means to add the preceding L to the macro parameter. That is to say, if we write _ T ("hello"), after expansion, it is l "hello"
B. Macro for calling string functions
C ++ also defines a series of macros for string functions. Similarly, we only give examples of several commonly used macros:
Macro |
Undefined _ Unicode (ANSI character) |
_ Unicode (UNICODE character) is defined) |
_ Tcschr |
Strchr |
Wcschr |
_ Tcscmp |
Strcmp |
Wcscmp |
_ Tcslen |
Strlen |
Wcslen |
4. Unicode programming using Win32 APIs
Win32 API defines some character data types. These data types are defined in the WINNT. h header file. For example:
typedef char CHAR; typedef unsigned short WCHAR; // wc, 16-bit UNICODE character typedef CONST CHAR *LPCSTR, *PCSTR;
Win32 API defines some macros that implement character and constant strings in the WINNT. h header file for ANSI/Unicode universal programming. Similarly, only a few of the most common examples are as follows:
#ifdef UNICODE typedef WCHAR TCHAR, *PTCHAR;typedef LPWSTR LPTCH, PTCH;typedef LPWSTR PTSTR, LPTSTR;typedef LPCWSTR LPCTSTR;#define __TEXT(quote) L##quote // r_winnt#else /* UNICODE */ // r_winnttypedef char TCHAR, *PTCHAR;typedef LPSTR LPTCH, PTCH;typedef LPSTR PTSTR, LPTSTR;typedef LPCSTR LPCTSTR;#define __TEXT(quote) quote // r_winnt#endif /* UNICODE */ // r_winnt
From the header file above, we can see that winnt. h is used for Conditional compilation based on whether Unicode (no underline) is defined.
Win32 API also defines a set of string functions, which are expanded into ANSI and Unicode string functions based on whether "Unicode" is defined. For example, lstrlen. The string operation functions of the API and the C ++ operation functions can implement the same functions. Therefore, if necessary, we recommend that you use the C ++ string function as much as possible, there is no need to spend too much energy learning these Apis.
You may have never noticed that the Win32 API actually has two versions. One version accepts the MBCS string, and the other accepts the Unicode string. For example, there is actually no API function setwindowtext (). On the contrary, there are setwindowtexta () and setwindowtextw (). Suffix A indicates that this is an MBCS function, and suffix W indicates that this is a unicode function. The header files of these API functions are declared in winuser. h. The declaration part of setwindowtext () function in winuser. H is given below:
#ifdef UNICODE#define SetWindowText SetWindowTextW#else#define SetWindowText SetWindowTextA#endif // !UNICODE
It can be seen that the API function determines whether to point to the UNICODE or MBCS version based on the definition of Unicode.
Careful readers may have noticed the difference between UNICODE and _ UNICODE. The former has no underline and is specially used for Windows header files. The latter has a prefix underline, which is specially used for C Runtime header files. In other words, in the ansi c ++ language, the macros are expanded to UNICODE or ANSI Characters Based on _ Unicode (with underscores). In Windows, the macros are expanded to UNICODE (without underscores) whether the macro is defined or not. The macros are expanded to Unicode or ANSI characters.
We will see later that we do not strictly differentiate in actual use, and define both _ UNICODE and UNICODE to implement UNICODE Version Programming.
V. Unicode coding applications in VC ++ 6.0
VC ++ 6.0 supports Unicode programming, but the default value is ANSI. Therefore, developers can easily write UNICODE-Supported Applications by slightly changing the coding habits.
Using VC ++ 6.0 for Unicode programming mainly involves the following tasks:
1. Add UNICODE and _ UNICODE preprocessing options for the project.
Specific steps: open [project]-> [settings…] Dialog box, as shown in 1, remove _ MBCS and add _ UNICODE and UNICODE in the "pre-processing program definition" in the C/C ++ label dialog box. (Note that separated by commas) after modification 2:
Figure 1
Figure 2
When UNICODE and _ UNICODE are not defined, all functions and types Use the ANSI version by default. After UNICODE and _ UNICODE are defined, all the MFC classes and Windows APIs have been changed to the wide-byte version.
2. Set the program entry point
Because the MFC application has a program entry point dedicated to Unicode, we need to set the entry point. Otherwise, a connection error occurs.
To set an entry point, open [project]-> [set…]. Dialog box, fill in wWinMainCRTStartup in the Entry Point of the Output category on the Link Page.
Figure 3
3. Use ANSI/Unicode Universal Data Types
Microsoft provides some common data types compatible with ANSI and Unicode. Our most common data types include _ T, TCHAR, LPTSTR, and LPCTSTR.
By the way, LPCTSTR and const TCHAR * are exactly the same. Here, L indicates the long pointer, which is left behind for compatibility with Windows 3.1 and other 16-bit operating systems. In Win32 and other 32-bit operating systems, the long pointer, near pointer, and far modifier are both intended for compatibility and have no practical significance. P (pointer) indicates a pointer; C (const) indicates a constant; T (_ T macro) indicates compatibility with ANSI and Unicode, STR (string) this variable is a string. In summary, we can see that LPCTSTR indicates a string that points to a fixed address and can change the semantics according to some macro definitions. For example:
TCHAR * szText = _ T ("Hello !"); TCHAR szText [] = _ T ("I Love You"); LPCTSTR lpszText = _ T ("Hello everyone !");
It is best to change the parameters in the function, for example:
MessageBox (_ T ("hello "));
In fact, in the preceding statement, even if you do not add a _ T macro, The MessageBox function will automatically forcibly convert the "hello" string. However, we recommend that you use the _ T macro to indicate that you are aware of Unicode encoding.
4. Modifying string operations
Some string operation functions need to obtain the number of characters (sizeof (szBuffer)/sizeof (TCHAR) of the string, while other functions may need to obtain the number of bytes of the string sizeof (szBuffer ). You should pay attention to this problem and carefully analyze the string operation functions to confirm the correct results.
ANSI operation functions start with str, such as strcpy (), strcat (), strlen ();
Unicode operation functions start with the wcs, such as wcscpy, wcscpy (), and wcslen ();
ANSI/Unicode operation functions start with _ tcs _ tcscpy (C Runtime Library );
ANSI/Unicode operation functions start with lstr lstrcpy (Windows function );
Considering compatibility between ANSI and Unicode, we need to use a universal string operation function starting with _ tcs or starting with lstr.
Vi. Unicode programming example
Step 1:
Open VC ++ 6.0 and create a project Unicode Based on the dialog box. In the main dialog box, add a button control to IDD_UNICODE_DIALOG, double-click the control, and add the response function of the control:
Void CUnicodeDlg: OnButton1 () {TCHAR * str1 = _ T ("ANSI and UNICODE encoding test"); m_disp = str1; UpdateData (FALSE );}
Add the static text box IDC_DISP and use ClassWizard to add the CString type variable m_disp to the control. Compile the project using the ansi.pdf environment and generate unicode.exe.
Step 2:
Open the control panel and click the date, time, language, and Region settings option, in the "date, time, language, and Region Settings" window, click the "region and language options" option to bring up the "region and language options" dialog box. In the dialog box, click the "advanced" tab, change the "language of a non-Unicode program" option to "Japanese", and click the "application" button:
Figure 4
In the displayed dialog box, click "yes" and restart the computer to make the settings take effect.
Run the unicode.exe program and click "Button1". The static text box is garbled.
Step 3:
Change to the unicode.exe environment to compile the project and generate unicode.exe. Run the unicode.exe program again and click "button1. See the advantages of Unicode encoding.
Let's just say that. Good luck.