Unicode programming materials (posting)
1. How to obtain the number of characters in a string that contains both single-byte and double-byte characters?
You can call the Runtime Library of Microsoft Visual C ++ to contain the function _ mbslen to operate multi-byte strings (including single-byte and dual-byte strings.
Calling the strlen function does not really know how many characters are in the string. It only tells you how many bytes are before the end of 0.
2. How to operate on DBCS strings?
Function Description
Ptstr charnext (lpctstr); returns the address of the next character in the string
Ptstr charprev (lpctstr, lpctstr); returns the address of the previous character in the string.
Bool isdbcsleadbyte (byte); if this byte is the first byte of the DBCS character, a non-zero value is returned.
3. Why Unicode?
(1) It is easy to exchange data between different languages.
(2) enable you to allocate a single. EXE file or DLL file that supports all languages.
(3) improve the running efficiency of applications.
Windows 2000 is developed from scratch using Unicode. If you call any windows function and pass it an ANSI string, the system must first convert the string to Unicode, then, the Unicode string is passed to the operating system. If you want the function to return an ANSI string, the system first converts the Unicode string to an ANSI string and then returns the result to your application. To convert these strings, the system time and memory are required. By developing applications with Unicode from the beginning, you can make your applications run more effectively.
Windows CE itself is an operating system that uses Unicode and does not support ANSI Windows functions.
Windows 98 only supports ANSI and can only develop applications for ANSI.
When Microsoft converts com from a 16-bit windows to Win32, the company determines that all the COM interface methods that require strings can only accept Unicode strings.
4. How to compile Unicode source code?
Microsoft has designed windowsapi for Unicode to minimize the impact of code. In fact, you can write a single source code file to compile it with or without Unicode. You only need to define two macros (Unicode and _ Unicode) to modify and re-compile the source file.
_ Unicode macro is used for the C Runtime header file, while Unicode macro is used for the Windows header file. When compiling the source code module, these two macros must be defined at the same time.
5. What Unicode data types are defined in windows?
Data Type description
Wchar Unicode Character
Pwstr pointer to Unicode string
Pcwstr pointer to a constant Unicode string
The corresponding ANSI data types are char, lpstr, and lpcstr.
The Common Data Types of ANSI/Unicode are tchar, ptstr, and lpctstr.
6. How to operate Unicode?
Character Set feature instance
ANSI operation functions start with str strcpy
Unicode operation functions start with the WCS wcscpy
The MBCS operation function starts with _ MBS _ mbscpy
ANSI/Unicode operation functions start with _ TCS _ tcscpy (C Runtime Library)
ANSI/Unicode operation functions start with lstr lstrcpy (Windows function)
All new and outdated functions have both ANSI and Unicode versions in Windows2000. Functions of the ANSI version end with a, and functions of the Unicode version end with W. Windows will be defined as follows:
# Ifdef Unicode
# Define createmediawex createmediawexw
# Else
# Define createmediawex createmediawexa
# Endif //! Unicode
7. How do I represent Unicode string constants?
Character Set instance
ANSI "string"
Unicode l "string"
ANSI/Unicode T ("string") or _ text ("string") if (szerror [0] ==_ text ('J ')){}
8. Why should I try to use operating system functions?
Secret. Because these functions are used a lot, they may have been loaded into RAM when the application is running.
Such as strcat, strchr, strcmp, and strcpy.
9. How do I write ANSI and Unicode-compliant applications?
(1) treat a text string as a character array instead of a chars array or byte array.
(2) Use common data types (such as tchar and ptstr) for text characters and strings.
(3) Use explicit data types (such as byte and pbyte) for byte, byte pointer, and data cache.
(4) use the text macro for the original characters and strings.
(5) perform global replacement (for example, replace pstr with ptstr ).
(6) Modifying string operations. For example, a function usually needs to pass a cached size in characters, rather than bytes. This means that sizeof (szbuffer) should not be passed, but sizeof (szbuffer)/sizeof (tchar) should be passed ). In addition, if you need to allocate a memory block to the string and have the number of characters in the string, remember to allocate memory by byte. That is to say, you should call
Malloc (ncharacters * sizeof (tchar) instead of calling malloc (ncharacters ).
10. How to compare the selected strings?
It is implemented by calling comparestring.
Logo meaning
Norm_ignorecase ignores uppercase and lowercase letters
Norm_ignorekanatype does not distinguish hirakana from katakana
Norm_ignorenonspace ignore no delimiter
Norm_ignoresymbols ignore symbols
Norm_ignorewidth does not distinguish between single-byte characters and double-byte characters.
Sort_stringsort uses punctuation marks as common symbols.
11. How can I determine whether a text file is ANSI or Unicode?
It is determined that if the first two bytes of the text file are 0xff and 0xfe, It is Unicode, otherwise it is ANSI.
12. How can I determine whether a string is ANSI or Unicode?
Use istextunicode for determination. Istextunicode uses a series of statistical and qualitative methods to guess the cached content. Because this is not an exact scientific method, istextunicode may return incorrect results.
13. How to convert a string between Unicode and ANSI?
The Windows function multibytetowidechar is used to convert a multi-byte string to a wide string. The function widechartomultibyte converts a wide string to an equivalent multi-byte string.
--------------------------------------------------------------------------------
Visual c ++ concept: Add function
(From msdn)
Unicode programming Abstract
To use MFC and C runtime to support Unicode, You need:
Define _ Unicode.
Define the _ Unicode symbol before the program is generated.
Specifies the entry point.
In the properties page dialog box of the project, set the wwinmaincrtstartup entry point symbol on the "output" page of the "linker" folder.
Use "portable" runtime functions and types.
Use the correct C Runtime function for Unicode string processing. You can use the WCS function family, but you may prefer to use a fully "portable" (supporting international) _ tchar macro. These macros are prefixed with _ TCS; they replace the STR function family one to one. These functions are described in detail in the International Section "Runtime Library Reference. For more information, see general text ing in tchar. h.
Use the _ tchar and related portable data types described in Unicode.
Process strings correctly.
The Visual C ++ compiler interprets the encoded string
L "this is a literal string" indicates that this is a Unicode character string. You can use the same prefix for text characters. Generally, the _ t macro is used to encode the string. Therefore, in UNICODE, the string is compiled as a unicode string. If Unicode is not used, the string is compiled as an ANSI string (including MBCS ). For example, do not use:
Pwnd-> setwindowtext ("hello"); use:
Pwnd-> setwindowtext (_ T ("hello"); Use the defined _ Unicode, _ t to translate the string into a format prefixed with L; otherwise, _ t translates the string into a format without the L prefix.
The prompt _ t macro is the same as the _ text macro.
Be careful when passing the string length to the function.
Some functions need to obtain the number of characters of a string, and some functions need to obtain the number of bytes of a string. For example, if _ Unicode is defined, the following calls to the carchive object are invalid (STR belongs to cstring ):
Archive. write (STR, str. getlength (); // invalid in the Unicode application, because each character is double byte width, the length will give the number of characters but not the correct number of bytes. Therefore, you must use:
Archive. Write (STR, str. getlength () * sizeof (_ tchar); // valid it specifies the correct number of bytes to write.
However, the MFC member functions are character-oriented rather than byte-oriented, so this extra encoding is not required:
PDC-> textout (STR, str. getlength (); CDC: textout uses the number of characters rather than the number of bytes.
In short, the MFC and runtime libraries provide the following support for Unicode programming in Windows 2000:
Apart from database member functions, all MFC (including cstring) functions support Unicode. Cstring also provides Unicode/ANSI conversion functions.
The Runtime Library provides Unicode versions for all string processing functions. (The Runtime Library also provides a "portable" version suitable for Unicode or MBCS. These versions are _ TCS macros .)
Tchar. H provides Portable Data Types and _ t macros for translating strings and characters. See tchar. h for general text ing.
The Runtime Library provides the wide character version of main. Use wmain to make the application "Unicode recognition ".
--------------------------------------------------------------------------------
Unicode programming in VC
In Windows, programming supports Unicode. The general trend is that the underlying system of Windows 2 k is Unicode-based. Even if you call the ansi api (end with a, such as setwidowstexta ), the system will also dynamically allocate a piece of memory on the default heap of your process, store the converted Unicode string, and then pass the converted string to the API, if you call an API whose return value is an ANSI string, Windows will perform reverse conversion in the background, which is time-consuming !! Even if you don't consider efficiency, don't you want your software to be internationalized? Do you still want to face the embarrassing problem of half a Chinese character?
In fact, Unicode programming in VC is not troublesome, probably as follows:
1. add Unicode and _ Unicode preprocessing options to the project, in vc.net, project-> property-> C/C ++-> Preprocessor adds these two macro definitions to "preprocessing definition" (Project-> Settings-> C/In vc6/ c ++-> Preprocessor definitions in general ).
2. include <tchar. h> (generally in stdafx. h) and then convert all the variables defined using char * To lptstr/tchar * or lpctstr/const tchar * (corresponding to const char *).
3. Wrap all string constants with _ T () macros, such as tchar * sztext = _ T ("My text ");
4. All the C-database string operation functions are replaced accordingly, such
Strlen-> _ tcslen
Strcat-> _ tcscat
Strcmp-> _ tcscmp
......
Note that the "text length" in these functions are the number of characters, rather than the number of char. For details, see msdn.
5. generally, no special processing is required for API calls. After Unicode and _ Unicode are defined, all APIs are directed by macros to the version ending with W (if not defined, they are directed to the version ending with ).
In fact, what we mentioned above is not to force you to use Unicode. If you still want to use ANSI, it's okay to remove the two macros defined in the first step, continue Our ANSI programming !!