Transfer from http://www.cnblogs.com/kex1n/archive/2010/03/15/2286510.html
Original source http://www.vckbase.com/document/viewdoc/?id=1733
First, what is Unicode
Starting with ASCII, ASCII is an encoding specification used to denote English characters. Each ASCII character occupies 1 bytes, so the maximum number of characters that the ASCII encoding can represent is 255 (00H-FFH). In fact, the English characters are not so much, generally only use the first 128 (00H-7FH, the highest bit is 0), which includes control characters, numbers, uppercase and lowercase letters and some other symbols. The top 1 of the other 128 characters (80H-FFH) are referred to as "extended ASCII", commonly used for storing English tabs, some phonetic characters, and other symbols.
This character encoding rule is obviously used to deal with English without any problems. But in the face of Chinese, Arabic and other complex words, 255 characters is not enough.
Therefore, each country has developed its own text coding specifications, which Chinese text encoding specification is called "gb2312-80", it is compatible with ASCII encoding code, in fact, the use of extended ASCII is not really standardized this point, A Chinese character is represented by two extended ASCII characters to differentiate the ASCII portion.
But this method has the problem, the biggest problem is the Chinese text encoding and the extended ASCII code has the overlap. Many software use the extended ASCII English tab to draw the table, such software used in the Chinese system, these tables will be mistaken as Chinese characters, garbled.
In addition, because countries and regions have their own text coding rules, they conflict with each other, which brings the exchange of information between countries and regions of great trouble.
To really solve this problem, not from the perspective of extended ASCII, but must have a new coding system, this system can be Chinese, French, German ... And so on all the text together consider, each text is assigned a separate encoding.
So, Unicode was born.
Unicode is also a character encoding method that occupies two bytes (0000H-FFFFH) and holds 65,536 characters, which can fully accommodate the encoding of all language literals in the world.
In Unicode, all characters are treated equally, the kanji no longer uses "two extended ASCII", but instead uses "1 Unicode", that is, all the text is processed by one character, and they all have a unique Unicode code.
Ii. benefits of using Unicode encoding
Using Unicode encoding enables your project to support multiple languages at the same time to internationalize your project.
In addition, Windows NT is developed using Unicode, and the entire system is Unicode-based. If you call an API function and pass it an ANSI (ASCII character set and the derived and compatible character set, such as: GB2312, commonly known as the ANSI character set) string, the system first converts the string to Unicode. The Unicode string is then passed to the operating system. If you want the function to return an ANSI string, the system first converts the Unicode string to an ANSI string, and then returns the result to your application. The conversion of these strings takes up the time and memory of the system. If you use Unicode to develop your application, you can make your application run more efficiently.
The following example shows the encoding of several characters to illustrate the differences between ANSI and Unicode in a nutshell:
Character |
A |
N |
And |
ANSI Code |
41H |
4eH |
Cdbah |
Unicode code |
0041H |
004eH |
548cH |
Third, Unicode programming using C + +
Support for wide characters is actually part of the ANSI C standard to support multibyte representations of one character. Wide characters and Unicode are not exactly equal, and Unicode is just one way to encode a wide character.
1, the definition of wide characters
In ANSI, one character (char) is the length of one byte (byte). When using Unicode, one character occupies one word, and C + + defines the most basic wide character type wchar_t in the wchar.h header file:
typedef unsigned short wchar_t;
From here we can clearly see that the so-called wide characters are unsigned short integers.
2. Constant width string
For C + + programmers, constructing string constants is a recurring task. So how do you construct a wide character string constant? It's simple, just add a capital l to the string constant, like this:
wchar_t *str1=l "Hello";
This l is very important, only with it, the compiler will know that you want to save the string as a character in one word. Also note that there can be no spaces between L and the string.
3. Wide String library function
To manipulate wide strings, C + + specifically defines a set of functions, such as a function that asks for a wide string length.
size_t __cdel Wchlen (const wchar_t*);
Why do you define these functions specifically? The most fundamental reason is that the strings under ANSI identify the end of the string with '/0 ' (the Unicode string ends with "/0/0"), and the correct operation of many string functions is based on this. And we know that in the case of wide characters, a character in memory to occupy a word space, this will make the operation of the ANSI character string function does not operate correctly. Take the "Hello" string for example, under the wide character, its five characters are:
0x0048 0x0065 0x006c 0x006c 0x006f
In memory, the actual arrangement is:
6c 6c XX 6f 00
Thus, the ANSI string function, such as strlen, after encountering the first 48 00 o'clock, will think the string to the end, with strlen to the width of the string length of the result will always be 1!
4, using macros to achieve the ANSI and Unicode universal programming
As you can see, C + + has a complete set of data types and functions for Unicode programming, that is, you could use C + + for Unicode programming.
If we want our program to have two versions: ANSI version and Unicode version. Of course, writing two sets of code to implement both the ANSI and Unicode versions is perfectly fine. However, maintaining two sets of code for ANSI characters and Unicode characters is a very cumbersome task. To ease the burden of programming, C + + defines a series of macros that help you achieve common programming for ANSI and Unicode.
The essence of general programming for C + + macros to implement ANSI and Unicode is to define or not, according to "_unicode" (note, underlined), these macros expand to ANSI or Unicode characters (strings).
Here are some of the code excerpt from the Tchar.h header file:
#ifdef _unicodetypedef wchar_t TCHAR; #define __t (x) l# #x#define _t (x) __t (x)#else#define __t (x) x Char TCHAR; #endif
Visible, these macros are expanded to ANSI or UNICODE characters, depending on whether they are defined as "_unicode". The macros defined in the Tchar.h header file can be divided into two categories:
A, implement a macro defined by a character and a constant string we only list the two most commonly used macros:
Macro |
_UNICODE not defined (ANSI character) |
Defines the _unicode (UNICODE character) |
TCHAR |
Char |
wchar_t |
_t (x) |
X |
l# #x |
Attention:
"# #" is the pre-processing syntax for the ANSI C standard, which is called "paste symbol", which means adding the previous L to the macro parameter. That is, if we write _t ("Hello"), it will be L "hello" after expansion.
B. Macros that implement string function calls
C + + is a string function also defines a series of macros, again, we just cite a few common macros:
Macro |
_UNICODE not defined (ANSI character) |
Defines the _unicode (UNICODE character) |
_tcschr |
Strchr |
Wcschr |
_tcscmp |
strcmp |
wcscmp |
_tcslen |
Strlen |
Wcslen |
Iv. Unicode programming using the Win32 API
Some of your own character data types are defined in the Win32 API. These data types are defined in the WinNT.h header file. For example:
Char short WCHAR; // WC,
The Win32 API defines some macros that implement characters and constant strings in the WinNT.h header file Ansi/unicode generic programming. Again, just a few of the most commonly used:
#ifdef UNICODE typedef WCHAR TCHAR,*ptchar;typedef lpwstr lptch, Ptch;typedef lpwstr ptstr, lptstr;typedef lpcwstr LPCTSTR;#define__text (quote) l# #quote//r_winnt#else/* UNICODE *///r_winnttypedefCharTCHAR, *ptchar;typedef LPSTR lptch, Ptch;typedef LPSTR ptstr, lptstr;typedef LPCSTR lpctstr;#define__text (quote) quote//r_winnt#endif/* UNICODE *///r_winnt
As can be seen from the above header file, WinNT.h is conditionally compiled based on whether Unicode is defined (no underscore).
The Win32 API also defines a set of string functions that are expanded to ANSI and Unicode string functions, respectively, based on whether "UNICODE" is defined. such as: Lstrlen. The API's string manipulation functions and C + + operation functions can achieve the same functionality, so if necessary, it is recommended that you use C + + String functions as much as possible, and there is no need to spend too much effort to learn these things from the API.
Perhaps you have never noticed that the Win32 API actually has two versions. One version accepts the MBCS string and the other accepts a Unicode string. For example: In fact there is no SetWindowText () This API function, on the contrary, there are setwindowtexta () and SETWINDOWTEXTW (). Suffix A indicates that this is the MBCS function, and the suffix w indicates that this is a Unicode version of the function. The header files for these API functions are declared in Winuser.h, and the following example states the declarations of the SetWindowText () function in Winuser.h:
#ifdef UNICODE #define SetWindowText setwindowtextw#else#define setwindowtext setwindowtexta#endif // ! UNICODE
As can be seen, the API functions decide whether to point to Unicode or MBCS versions based on the definition of Unicode.
The attentive reader may have noticed the difference between Unicode and _unicode, which is not underlined and is dedicated to the Windows header file, which has a prefix underscore specifically for the C run-time header file. In other words, that is, in the ANSI C + + language based on _unicode (underlined) defined or not, the macros are expanded to Unicode or ANSI characters, in Windows based on Unicode (no underscore) defined or not, The macros are expanded to Unicode or ANSI characters, respectively.
In the following we will see that we do not strictly differentiate between actual use, and define both _unicode and Unicode to achieve Unicode version programming.
V. Writing UNICODE-encoded applications in vc++6.0
VC + + 6.0 supports Unicode programming, but the default is ANSI, so developers can easily write Unicode-enabled applications by simply changing the habit of writing code.
Using VC + + 6.0 for Unicode programming is mainly done in the following tasks:
1. Add Unicode and _unicode preprocessing options for the project.
Specific steps: Open [Engineering]->[settings ...] dialog box to remove _mbcs, plus _unicode,unicode, in the preprocessor definitions in the C/C + + label dialog box.
When Unicode and _unicode are not defined, all functions and types use the ANSI version by default, and after Unicode and _unicode are defined, all MFC classes and Windows APIs become wide-byte versions.
2, set the program entry point
Because MFC applications have entry points for Unicode-specific programs, we want to set entry point. Otherwise, a connection error will occur.
To set entry point, open [Engineering]->[settings ...] dialog box, fill in the wWinMainCRTStartup in the entry point of the output category of the link page.
3. Using Ansi/unicode Universal Data type
Microsoft provides a number of ANSI and Unicode compatible common data types, and our most commonly used data types are _t, tchar,lptstr,lpctstr.
By the way, LPCTSTR and the const tchar* are exactly the same. where l represents a long pointer, which is for compatibility with 16-bit operating systems such as Windows 3.1, in Win32 and in other 32-bit operating systems, the long pointer and the near pointer and the far modifier are all for compatibility purposes and have no practical significance. P (pointer) indicates that this is a pointer; C (const) represents a constant; T (_t macro) means that the compatibility of ANSI and UNICODE,STR (string) means that the variable is a string. As you can see, LPCTSTR represents a string that points to a constant fixed address that can change semantics based on some macro definitions. Like what:
tchar* sztext=_t ("hello!"); TCHAR sztext[]=_t ("I Love You"); LPCTSTR lpszText=_t ("Hello everyone!") ”);
It is best to change the parameters in a function, such as:
MessageBox (_t ("Hello"));
In fact, in the above statement, the MessageBox function automatically casts a "hello" string even if you do not add _t macros. But I still recommend that you use the _t macro to indicate that you have Unicode encoding awareness.
4. Modifying string arithmetic problems
Some string manipulation functions need to get the number of characters in a string (sizeof (szbuffer)/sizeof (TCHAR)), while others may need to get the number of bytes of string sizeof (Szbuffer). You should be aware of the problem and carefully parse the string manipulation functions to determine that you can get the correct results.
The ANSI operation functions start with STR, such as strcpy (), strcat (), strlen ();
The Unicode manipulation functions start with WCS, such as wcscpy,wcscpy (), wcslen ();
The Ansi/unicode operation function starts with _tcs _tcscpy (C run-time library);
The Ansi/unicode action function starts with LSTR lstrcpy (Windows functions);
Considering ANSI and Unicode compatibility, we need to use a generic string manipulation function that begins with _tcs or LSTR.
Vi. example of a Unicode programming
The first step:
Open vc++6.0, create a new dialog-based project UNICODE, add a button control to the main dialog box Idd_unicode_dialog, double-click the control, and add the control's response function:
void Cunicodedlg::onbutton1 () { TCHAR* str1=_t ("ANSI and Unicode coding experiments "); M_disp=str1; UpdateData (FALSE);}
Add a static text box idc_disp, and use ClassWizard to add the CString type variable M_DISP to the control. Compiles the project using the default ANSI encoding environment, generating Unicode.exe.
Step Two:
Open the Control Panel, click the date, time, language, and Regional Settings option, and in the date, time, language, and Regional Settings window, continue clicking the Regional and Language Options option to bring up the Regional and Language Options dialog box. In the dialog box, click the Advanced tab, change the language of the non-Unicode program to Japanese, and click the Apply button.
Pop-up dialog box Click Yes to restart the computer for the settings to take effect.
Run the Unicode.exe program and click the "Button1" button to see that the static text box appears garbled.
Step Three:
The project is compiled with a Unicode encoding environment, and the Unicode.exe is generated. Run the Unicode.exe program again and click the Button1 button. See the benefits of Unicode encoding.
"Turn" VC + + Unicode Programming