Conversion of Unicode and ANSI strings

Source: Internet
Author: User
Tags control characters

Following the story of the previous episode, the multi-character set (ANSI) and Unicode and string processing methods
We have some special requirements:

Sometimes our string is of the Multi-character type, but we need to use the wide character type; sometimes, the opposite is true.

Windows provides such a function to solve this problem:

1. multibytetowidechar

Function:
This function maps a string to a unicode string. The string mapped by this function does not need to be a multi-byte character group.

Function prototype:
Int multibytetowidechar (uint codePage, DWORD dwflags, lpcstr lpmultibytestr, int cchmultibyte,

Lpwstr lpwidecharstr, int cchwidechar );


Parameters:

CodePage: Specifies the code page for conversion. This parameter can be set to any code page that has been installed or is valid. You can also specify it as any of the following values:

Cp_acp: ANSI code page; cp_maccp: Macintosh code page; cp_oemcp: OEM code page;

Cp_symbol: symbol code page (42); cp_thread_acp: ANSI code page of the current thread;

Cp_utf7: converts using a UTF-7; cp_utf8: converts using a UTF-8.

Dwflags: a set of BITs that indicate whether they are not converted to premade or wide characters (if in combination), whether to replace control characters with text, and how to handle invalid characters. You can specify the combination of the Mark constants as follows:

Mb_precomposed: generally used as a character -- that is to say, only one single character value is entered for a character group consisting of a basic character and a non-empty character. This is the default conversion option. Cannot match

Use the mb_composite value together.

Mb_composite: Generally, a composite character -- that is, a character consisting of a basic character and a non-null character has different character values. It cannot be used with the mb_precomposed value.

Mb_err_invalid_chars: If the function encounters an invalid input character, it fails to run and getlasterro returns the error_no_unicode_translation value.

Mb_useglyphchars: replace control characters with hieroglyphics.

A composite Character consists of a base character and a non-null character. Each character has a different character value. Each premade character is composed of a single character value for basic/non-null characters. E is the basic character, while note-heavy mark is a non-null character.

The default action of a function is to convert it into a premade form. If the pre-made form does not exist, the function will try to convert it into a combination form.

Marking mb_precomposed and mb_composite is mutually exclusive, while marking mb_useglyphchars and mb_err_invalid_chars can be set regardless of other marking.

Lpmultibytestr: string to be converted.

Cchmultibyte: specifies the number of bytes in the string directed by the lpmultibytestr parameter. If the value is-1, the string is set to a string with null as the end character and the length is automatically calculated.

Lpwidecharstr: indicates the buffer to receive the converted string.

Cchwidechar: specifies the number of bytes in the buffer zone directed by the lpwidecharstr parameter. If the value is zero, the function returns the required number of wide characters in the buffer. In this case, the buffer in lpwidecharstr is not used.


Return Value

If the function runs successfully and cchwidechar is not zero, the return value is the number of wide characters written in the slow-forward area pointed to by lpwidecharstr. If the function runs successfully and cchmultibyte is zero, the return value is the size of the wide characters required for receiving the buffer of the string to be converted. If the function fails to run, the return value is zero. To obtain more error information, call the getlasterror function. It returns the following error codes:

Error_insufficient_buffer; error_invalid_flags;

Error_invalid_parameter; error_no_unicode_translation.


Note:

The pointer lpmultibytestr and lpwidecharstr must be different. If the same, the function will fail, and getlasterror will return the error_invalid_parameter value.

If mb_err_invalid_chars is set and an invalid character is encountered in the resource string, the function count fails. If mb_err_invalid_chars is not set, or the header byte is found in the DBCS string without a valid tail byte, invalid characters are converted to default characters, but not the Default Characters in the resource string. When an invalid character is found and the mb_err_invalid_chars value is set, the function returns zero. getlasterro displays error information about error_no_unicode_translation.

2. widechartomultibyte

Function:
This function maps a unicode string to a multi-byte string.

Function prototype:
Int widechartomultibyte (uint codePage, DWORD dwflags, lpwstr lpwidecharstr, int cchwidechar, lpcstr character, int cchmultibyte, lpcstr lpdefaultchar, pbool pfuseddefaultchar );

Parameters:

CodePage: Specifies the code page for conversion. This parameter can be set to any code page that has been installed or is valid. You can also specify it as any of the following values:

Cp_acp: ANSI code page; cp_maccp: Macintosh code page; cp_oemcp: OEM code page;

Cp_symbol: symbol code page (42); cp_thread_acp: ANSI code page of the current clue;

Cp_utf7: converts using a UTF-7; cp_utf8: converts using a UTF-8.

Dwflags [in] specifies the handling of unmapped characters. The function performs more quickly when none of these flags is set. The following flag constants are defined.

Value

Meaning

Wc_no_best_fit_chars

Windows 98/me and Windows 2000/XP: Any Unicode characters that do not translate directly to multibyte equivalents are translated to the default character (see lpdefaultchar parameter ). in other words, if translating from Unicode to multibyte and back to Unicode again does not yield the exact same UNICODE character, the default character is used. this flag can be used by itself or in combination with the other dwflag options.

Wc_compositecheck

Convert composite characters to precomposed characters.

Wc_discardns

Discard nonspacing characters during conversion.

Wc_sepchars

Generate separate characters during conversion. This is the default conversion behavior.

Wc_defachar char

Replace exceptions with the default character during conversion.

When wc_compositecheck is specified, the function converts composite characters to precomposed characters. A composite Character consists of a base character and a nonspacing character, each having different character values. A precomposed character has a single character value for a base/nonspacing character combination. in the character, the E is the base character, and the accent grave mark is the nonspacing character.

When an application specifies wc_compositecheck, it can use the last three flags in this list (wc_discardns, wc_sepchars, and wc_defaultchar) to customize the conversion to precomposed characters. these flags determine the function's behavior when there is no precomposed mapping for a base/nonspace character combination in a wide-character string. these last three flags can only be used if the wc_compositecheck flag is set.

The function's default behavior is to generate separate characters (wc_sepchars) for unmapped composite characters.

For the code pages in the following table, dwflags must be zero, otherwise the function fails with error_invalid_flags.

50220 50221

50222

50225

50227 50229

52936

54936

57002 through 57011 65000 (utf7)

65001 (utf8)

42 (Symbol)


Related Variables

Lpwidecharstr: point to the Unicode string to be converted.

Cchwidechar: specifies the number of characters in the buffer zone directed by the lpwidecharstr parameter. If the value is-1, the string is set to a string with null as the end character and the length is automatically calculated.

Lpmultibytestr: indicates the buffer zone that receives the converted string.

Cchmultibyte: specify the maximum value of the buffer area to which the parameter lpmultibytestr points (measured in bytes ). If the value is zero, the function returns the number of bytes required for the target buffer to which lpmultibytestr points. In this case, the lpmultibytestr parameter is usually null.

Lpdefaultchar and pfuseddefachar CHAR: The widechartomultibyte function uses these two parameters only when the widechartomultibyte function encounters a wide byte character that is not represented in the code page marked by the ucodepage parameter. If the wide byte character cannot be converted, this function uses the character pointed to by the lpdefaultchar parameter. If this parameter is null (this is the parameter value in most cases), the function uses the default character of the system. This default character is usually a question mark. This is dangerous for file names, because the question mark is a wildcard. The pfuseddefaultchar parameter points to a Boolean variable. if at least one character in a unicode string cannot be converted to an equivalent multi-byte character, the function sets the variable to true. If all characters are successfully converted, this function sets this variable to false. When the function returns to check whether the wide byte string is successfully converted, you can test the variation.

Return Value:
If the function runs successfully and cchmultibyte is not zero, the returned value is the number of bytes written in the buffer directed by lpmultibytestr. If the function runs successfully and cchmultibyte is zero, the returned value is the number of bytes required to receive the buffer of the string to be converted. If the function fails to run, the return value is zero. To obtain more error information, call the getlasterror function. It returns the following error codes:

Error_insufficient_bjffer; error_invalid_flags;

Error_invalid_parameter; error_no_unicode_translation.

Note:
The pointer lpmultibytestr and lpwidecharstr must be different. If the same, the function will fail, and getlasterror will return the error_invalid_parameter value.

In addition, there are several interesting string processing functions:

1. gettextcharsetinfo

Function:
Returns the current font information in the specified device description table.

Function prototype

Int gettextcharsetinfo (
_ In HDC,
_ Out lpfontsignature lpsig,
_ In DWORD dwflags
);

Parameters


HDC
[In] handle of the device description table.

Lpsig
[Out] (optional. A pointer to a fontsignature data structure that receives the font signature information.

If a true type font is selected into the current do not describe table, the structure will receive font characters that can identify the encoding page and Unicode subrange.

If the font is not true type, the structure will receive 0. At this time, the application should use the translatecharsetinfo function to obtain the original font signature.

If the application does not need fontsignature information, you can set this parameter to null. In this case, the application can also call the gettextcharset function, and the call to this function and the call to the gettextcharset function is an effect.

Dwflags
[In] is retained. It must be 0.

Return Value


If the value is exceeded, a value is returned to identify the character set information of the font in the selected device description table. If the function fails, the return value is default_charset.

2. gettextcharset
Function:

This function obtains the character set of the currently selected device description table, which is equivalent to gettextcharsetinfo (HDC, null, 0 ).

Function prototype:
Int gettextcharset (HDC );

Return Value:
The character set identifier is returned. If the Character Set Identifier fails, the default_charset is returned.

3. bool isdbcsleadbyte (byte testchar );

This function uses the current windowsansi code page to determine whether the specified byte is a potential dubyte (DBCS) Pilot byte. If other code pages are used, use

Isdbcsleadbyteex function.

4. bool isdbcsleadbyteex (uint codePage, byte testchar );

This function determines whether the specified byte is a potential dubyte (DBCS) Pilot byte.
The available codePage values include cp_acp, cp_maccp, cp_oemcp, and cp_thread_acp.

5.
Bool istextunicode (const void * pbuffer, int CB, lpint LPI );

This function determines whether a specified buffer may contain Unicode text in a certain format. This function is not accurate, so it may return incorrect results. Notepad is used

So sometimes garbled characters occur. Therefore, use this function with caution.

Three other functions not recommended for Windows: nlsdllcodepagetranslation, unicodetobytes, and bytestounicode.

Among them, we focus on the multibytetowidechar and widechartomultibyte functions. The usage of these two functions is as follows,

Follow the story in the next episode to make your program more suitable-use ANSI and Unicode to export functions


For the same series of articles, see:

Multi-character set (ANSI) and Unicode and string processing method guidelines


 


Make your program more suitable-use ANSI and Unicode export functions
"




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.