The Unicode programming of VC + +

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article from: http://tech.ddvip.com/2007-03/117395585321221.html

First, what is Unicode

Starting with ASCII, ASCII is an encoding specification used to denote English characters. Each ASCII character occupies 1 bytes, so the maximum number of characters that the ASCII encoding can represent is 255 (00H-FFH). In fact, the English characters are not so much, generally only use the first 128 (00H-7FH, the highest bit is 0), which includes control characters, numbers, uppercase and lowercase letters and some other symbols. The top 1 of the other 128 characters (80H-FFH) are referred to as "extended ASCII", commonly used for storing English tabs, some phonetic characters, and other symbols.

This character encoding rule is obviously used to deal with English without any problems. But in the face of Chinese, Arabic and other complex words, 255 characters is not enough.

Therefore, each country has developed its own text coding specifications, which Chinese text encoding specification is called "gb2312-80", it is compatible with ASCII encoding code, in fact, the use of extended ASCII is not really standardized this point, A Chinese character is represented by two extended ASCII characters to differentiate the ASCII portion.

But this method has the problem, the biggest problem is the Chinese text encoding and the extended ASCII code has the overlap. Many software use the extended ASCII English tab to draw the table, such software used in the Chinese system, these tables will be mistaken as Chinese characters, garbled.

In addition, because countries and regions have their own text coding rules, they conflict with each other, which brings the exchange of information between countries and regions of great trouble.

To really solve this problem, not from the perspective of extended ASCII, but must have a new coding system, this system can be Chinese, French, German ... And so on all the text together consider, each text is assigned a separate encoding.

So, Unicode was born.

Unicode is also a character encoding method that occupies two bytes (0000H-FFFFH) and holds 65,536 characters, which can fully accommodate the encoding of all language literals in the world.

In Unicode, all characters are treated equally, the kanji no longer uses "two extended ASCII", but instead uses "1 Unicode", that is, all the text is processed by one character, and they all have a unique Unicode code.

Ii. benefits of using Unicode encoding

Using Unicode encoding enables your project to support multiple languages at the same time to internationalize your project.

In addition, Windows NT is developed using Unicode, and the entire system is Unicode-based. If you call an API function and pass it an ANSI (ASCII character set and the derived and compatible character set, such as: GB2312, commonly known as the ANSI character set) string, the system first converts the string to Unicode. The Unicode string is then passed to the operating system. If you want the function to return an ANSI string, the system first converts the Unicode string to an ANSI string, and then returns the result to your application. The conversion of these strings takes up the time and memory of the system. If you use Unicode to develop your application, you can make your application run more efficiently.

The following example shows the encoding of several characters to illustrate the differences between ANSI and Unicode in a nutshell:

Character	A	N	And
ANSI Code	41H	4eH	Cdbah
Unicode code	0041H	004eH	548cH

Third, Unicode programming using C + +

Support for wide characters is actually part of the ANSI C standard to support multibyte representations of one character. Wide characters and Unicode are not exactly equal, and Unicode is just one way to encode a wide character.

1, the definition of wide characters

In ANSI, one character (char) is the length of one byte (byte). When using Unicode, one character occupies one word, and C + + defines the most basic wide character type wchar_t in the wchar.h header file:

typedef unsigned short wchar_t;

From here we can clearly see that the so-called wide characters are unsigned short integers.

2. Constant width string

For C + + programmers, constructing string constants is a recurring task. So how do you construct a wide character string constant? It's simple, just add a capital l to the string constant, like this:

wchar_t *str1=L" Hello";

This l is very important, only with it, the compiler will know that you want to save the string as a character in one word. Also note that there can be no spaces between L and the string.

3. Wide String library function

To manipulate wide strings, C + + specifically defines a set of functions, such as a function that asks for a wide string length.

size_t __cdel wchlen(const wchar_t*);

Why do you define these functions specifically? The most fundamental reason is that the strings under ANSI are "to identify the end of the string (the Unicode string ends with" "), and the correct operation of many string functions is based on this. And we know that in the case of wide characters, a character in memory to occupy a word space, this will make the operation of the ANSI character string function does not operate correctly. Take the "Hello" string for example, under the wide character, its five characters are:

0x0048 0x0065 0x006c 0x006c 0x006f

In memory, the actual arrangement is:

48 00 65 00 6c 00 6c 00 6f 00

Thus, the ANSI string function, such as strlen, after encountering the first 48 00 o'clock, will think the string to the end, with strlen to the width of the string length of the result will always be 1!

4, using macros to achieve the ANSI and Unicode universal programming

As you can see, C + + has a complete set of data types and functions for Unicode programming, that is, you could use C + + for Unicode programming.

If we want our program to have two versions: ANSI version and Unicode version. Of course, writing two sets of code to implement both the ANSI and Unicode versions is perfectly fine. However, maintaining two sets of code for ANSI characters and Unicode characters is a very cumbersome task. To ease the burden of programming, C + + defines a series of macros that help you achieve common programming for ANSI and Unicode.

The essence of general programming for C + + macros to implement ANSI and Unicode is to define or not, according to "_unicode" (note, underlined), these macros expand to ANSI or Unicode characters (strings).

Here are some of the code excerpt from the Tchar.h header file:

#ifdef　_UNICODE typedef wchar_t　　 TCHAR; #define __T(x)　　　L##x #define _T(x)　　　 __T(x) #else #define __T(x)　　　x typedef char　　　　　　TCHAR; #endifVisible, these macros are expanded to ANSI or UNICODE characters, depending on whether they are defined as "_unicode". The macros defined in the Tchar.h header file can be divided into two categories:

A, implement a macro defined by a character and a constant string we only list the two most commonly used macros:

Macro	_UNICODE not defined (ANSI character)	Defines the _unicode (UNICODE character)
TCHAR	Char	wchar_t
_t (x)	X	l# #x

Attention:

"# #" is the pre-processing syntax for the ANSI C standard, which is called "paste symbol", which means adding the previous L to the macro parameter. That is, if we write _t ("Hello"), it will be L "hello" after expansion.

B. Macros that implement string function calls

C + + is a string function also defines a series of macros, again, we just cite a few common macros:

Macro	_UNICODE not defined (ANSI character)	Defines the _unicode (UNICODE character)
_tcschr	Strchr	Wcschr
_tcscmp	strcmp	wcscmp
_tcslen	Strlen	Wcslen

Iv. Unicode programming using the Win32 API

Some of your own character data types are defined in the Win32 API. These data types are defined in the WinNT.h header file. For example:

typedef char CHAR; typedef unsigned short WCHAR;　　// wc,　 16-bit UNICODE character typedef CONST CHAR *LPCSTR, *PCSTR;The Win32 API defines some macros that implement characters and constant strings in the WinNT.h header file Ansi/unicode generic programming. Again, just a few of the most commonly used: as #ifdef　UNICODE typedef WCHAR TCHAR, *PTCHAR; typedef LPWSTR LPTCH, PTCH; typedef LPWSTR PTSTR, LPTSTR; typedef LPCWSTR LPCTSTR; #define __TEXT(quote) L##quote　　　// r_winnt #else　　　　　　　　 // r_winnt typedef char TCHAR, *PTCHAR; typedef LPSTR LPTCH, PTCH; typedef LPSTR PTSTR, LPTSTR; typedef LPCSTR LPCTSTR; #define __TEXT(quote) quote　　　　 // r_winnt #endif 　　　　　　　　// r_winnt can be seen from the above header file, WinNT.h is conditionally compiled based on whether Unicode is defined (no underscore).

The Win32 API also defines a set of string functions that are expanded to ANSI and Unicode string functions, respectively, based on whether "UNICODE" is defined. such as: Lstrlen. The API's string manipulation functions and C + + operation functions can achieve the same functionality, so if necessary, it is recommended that you use C + + String functions as much as possible, and there is no need to spend too much effort to learn these things from the API.

Perhaps you have never noticed that the Win32 API actually has two versions. One version accepts the MBCS string and the other accepts a Unicode string. For example: In fact there is no SetWindowText () This API function, on the contrary, there are setwindowtexta () and SETWINDOWTEXTW (). Suffix A indicates that this is the MBCS function, and the suffix w indicates that this is a Unicode version of the function. The header files for these API functions are declared in Winuser.h, and the following example shows the declaration portion of the SetWindowText () function in winuser.h: It #ifdef UNICODE #define SetWindowText　SetWindowTextW #else #define SetWindowText　SetWindowTextA #endif // !UNICODE is visible that the API function determines whether the Unicode version or the MBCS version is determined by the definition.

The attentive reader may have noticed the difference between Unicode and _unicode, which is not underlined and is dedicated to the Windows header file, which has a prefix underscore specifically for the C run-time header file. In other words, that is, in the ANSI C + + language based on _unicode (underlined) defined or not, the macros are expanded to Unicode or ANSI characters, in Windows based on Unicode (no underscore) defined or not, The macros are expanded to Unicode or ANSI characters, respectively.

In the following we will see that we do not strictly differentiate between actual use, and define both _unicode and Unicode to achieve Unicode version programming.

V. Writing UNICODE-encoded applications in vc++6.0

VC + + 6.0 supports Unicode programming, but the default is ANSI, so developers can easily write Unicode-enabled applications by simply changing the habit of writing code.

Using VC + + 6.0 for Unicode programming is mainly done in the following tasks:

1. Add Unicode and _unicode preprocessing options for the project.

Specific steps: Open [Engineering]->[settings ...] dialog box, 1, removes the _MBCS, plus _unicode,unicode, in the preprocessor definition in the C + + label dialog box. (Note that the middle is separated by commas) after the Change 2:

Figure A

Figure II

When Unicode and _unicode are not defined, all functions and types use the ANSI version by default, and after Unicode and _unicode are defined, all MFC classes and Windows APIs become wide-byte versions.

2, set the program entry point

Because MFC applications have entry points for Unicode-specific programs, we want to set entry point. Otherwise, a connection error will occur.

To set entry point, open [Engineering]->[settings ...] dialog box, fill in the wWinMainCRTStartup in the entry point of the output category of the link page.

Might

3. Using Ansi/unicode Universal Data type

Microsoft provides a number of ANSI and Unicode compatible common data types, and our most commonly used data types are _t, tchar,lptstr,lpctstr.

By the way, LPCTSTR and the const tchar* are exactly the same. where l represents a long pointer, which is for compatibility with 16-bit operating systems such as Windows 3.1, in Win32 and in other 32-bit operating systems, the long pointer and the near pointer and the far modifier are all for compatibility purposes and have no practical significance. P (pointer) indicates that this is a pointer; C (const) represents a constant; T (_t macro) means that the compatibility of ANSI and UNICODE,STR (string) means that the variable is a string. As you can see, LPCTSTR represents a string that points to a constant fixed address that can change semantics based on some macro definitions. Like what:

TCHAR* szText=_T(“Hello!”); TCHAR szText[]=_T(“I Love You”); LPCTSTR lpszText=_T(“大家好！”);It is best to change the parameters in a function, such as:MessageBox(_T(“你好”));

In fact, in the above statement, the MessageBox function automatically casts a "hello" string even if you do not add _t macros. But I still recommend that you use the _t macro to indicate that you have Unicode encoding awareness.

4. Modifying string arithmetic problems

Some string manipulation functions need to get the number of characters in a string (sizeof (szbuffer)/sizeof (TCHAR)), while others may need to get the number of bytes of string sizeof (Szbuffer). You should be aware of the problem and carefully parse the string manipulation functions to determine that you can get the correct results.

The ANSI operation functions start with STR, such as strcpy (), strcat (), strlen ();

The Unicode manipulation functions start with WCS, such as wcscpy,wcscpy (), wcslen ();

The Ansi/unicode operation function starts with _tcs _tcscpy (C run-time library);

The Ansi/unicode action function starts with LSTR lstrcpy (Windows functions);

Considering ANSI and Unicode compatibility, we need to use a generic string manipulation function that begins with _tcs or LSTR.

Vi. example of a Unicode programming

The first step:

Open vc++6.0, create a new dialog-based project UNICODE, add a button control to the main dialog box Idd_unicode_dialog, double-click the control, and add the control's response function:

void CUnicodeDlg::OnButton1() { 　　TCHAR* str1=_T("ANSI和UNICODE编码试验"); 　　m_disp=str1; 　　UpdateData(FALSE); }Add a static text box idc_disp, and use ClassWizard to add the CString type variable M_DISP to the control. Compiles the project using the default ANSI encoding environment, generating Unicode.exe.

Step Two:

Open the Control Panel, click the date, time, language, and Regional Settings option, and in the date, time, language, and Regional Settings window, continue clicking the Regional and Language Options option to bring up the Regional and Language Options dialog box. In the dialog box, click the Advanced tab, change the language of the non-Unicode program to Japanese, click the Apply button, four:

Figure Four

Pop-up dialog box Click Yes to restart the computer for the settings to take effect.

Run the Unicode.exe program and click the "Button1" button to see that the static text box appears garbled.

Step Three:

The project is compiled with a Unicode encoding environment, and the Unicode.exe is generated. Run the Unicode.exe program again and click the Button1 button. See the benefits of Unicode encoding.

The functions related to character manipulation are as follows:

Character classification: Wide character function general C function description
Iswalnum () isalnum () test whether the character is a number or a letter
Iswalpha () Isalpha () test whether the character is a letter
Iswcntrl () Iscntrl () test whether the character is a control
Iswdigit () isdigit () test whether the character is a number
Iswgraph () isgraph () test whether the character is a visible character
Iswlower () islower () test if the character is lowercase characters
Iswprint () isprint () test whether the character is a printable character
Iswpunct () ispunct () test whether the character is a punctuation mark
Iswspace () isspace () test whether the character is a blank symbol
Iswupper () isupper () test whether the character is uppercase characters
Iswxdigit () isxdigit () test whether the character is a hexadecimal digit

Case conversion:
Wide character function general C function description
Towlower () ToLower () converts a character to lowercase
Towupper () ToUpper () converts characters to uppercase

Character comparison: wide character function general C function description
Wcscoll () strcoll () comparison string

Date and Time conversion:
Wide character Function description
Strftime () Formats the date and time according to the specified string format and locale setting
Wcsftime () Formats the date and time according to the specified string format and locale, and returns a wide string
Strptime () Converts a string to a time value according to a specified format, which is the inverse of strftime

Print and Scan strings:
Wide character function description
fprintf ()/fwprintf ()        formatted output using VARARG parameters
FSCANF ()/fwscanf ()            format read-in
printf ()                 formatted output with vararg parameter to standard output
scanf ()               read from the format of the standard input into
sprintf ()/ swprintf ()        formatted as string according to vararg parameter table
sscanf ()                 read in string format
vfprintf ()/vfwprintf ()         using the Stdarg parameter table to format output to a file
vprintf ()                 using stdarg parameter table to format output to standard output
vsprintf ()/vswprintf ()        Format the Stdarg parameter table and write to the string

Digital conversions:
Wide character function general C function description
Wcstod () strtod () converts the initial part of a wide character to a double-precision floating-point number
Wcstol () strtol () converts the initial part of a wide character to a long integer
Wcstoul () Strtoul () converts the initial part of a wide character to an unsigned long integer

Multibyte character and wide character conversion and manipulation:
Wide character Function description
Mblen () determines the number of bytes of a character based on locale settings
MBSTOWCS () converts multibyte strings to wide strings
MBTOWC ()/BTOWC () converts multi-byte characters to wide characters
Wcstombs () converts a wide string to a multibyte string
Wctomb ()/wctob () converts wide characters to multibyte characters

Input and output:
Wide character function general C function description
FGETWC () fgetc () reads a character from the stream and converts it into a wide character
Fgetws () fgets () reads a string from the stream and converts it into a wide string
FPUTWC () FPUTC () converts wide characters to multibyte characters and outputs to standard output
Fputws () fputs () converts a wide string to multibyte characters and outputs to a standard output string
GETWC () getc () reads characters from standard input and converts them to wide characters
Getwchar () GetChar () reads characters from standard input and converts them to wide characters
None gets () use FGETWS ()
PUTWC () PUTC () converts wide characters to multibyte characters and writes to standard output
Putwchar () Putchar () converts wide characters to multibyte characters and writes to standard output
None puts () use FPUTWS ()
UNGETWC () ungetc () descriptors a wide character back into the input stream

String manipulation:
Wide character function general C function description
Wcscat () strcat () A string to the tail of another string
Wcsncat () Strncat () is similar to Wcscat () and specifies the glue length of the bonded string.
WCSCHR () STRCHR () finds the first position of a substring
WCSRCHR () STRRCHR () finds the first occurrence of a substring starting at the tail
WCSPBRK () strpbrk () finds the position of the first occurrence of any character in another string from a character string
Wcswcs ()/wcsstr () STRCHR () finds the position of the first occurrence of another string in a string
WCSCSPN () strcspn () returns the initial number that does not contain a second string
WCSSPN () strspn () returns the initial number containing the second string
wcscpy () strcpy () copy string
wcsncpy () strncpy () similar to wcscpy (), specifying the number of copies
WCSCMP () strcmp () Comparison of two wide strings
WCSNCMP () strncmp () is similar to wcscmp () and also specifies the number of comparison character strings
Wcslen () strlen () Gets the number of wide strings
Wcstok () strtok () breaks a wide string into a series of strings based on the identifier
Wcswidth () None Gets the width of the wide string
Wcwidth () None Gets the width of the wide character

There are also wmemcpy (), WMEMCHR (), wmemcmp (), Wmemmove (), Wmemset () corresponding to the memory operation

http://blog.csdn.net/witch_soya/article/details/6851590

The Unicode programming of VC + +

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The Unicode programming of VC + +

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The Unicode programming of VC + +

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support