Unicode and Character Set functions (source: Network)

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first function is the conversion function between wide characters and multi-byte characters. The function prototype is as follows:

Int widechartomultibyte (
Uint codePage,
DWORD dwflags,
Lpcwstr lpwidecharstr,
Int cchwidechar,
Lpstr lpmultibytestr,
Int cbmultibyte,
Lpstr lpdefaultchar,
Lpbool lpuseddefaultchar
);

This function converts a wide string to a specified new string, such as ANSI and utf8. The new string does not need to be a multi-byte character set.

Parameters:

CodePage: Specifies the character set code page to be converted to. It can be any installed or built-in character set. You can also use one of the following code pages.

Cp_acp current system ANSI code page
Cp_maccp current system Macintosh code page
Cp_oemcp current system OEM code page, a hardware scanning code of the original device manufacturer
Cp_symbol symbol code page, used in Windows 2000 and later versions. I don't know what it is (wm5.0 not support)
Cp_thread_acp the ANSI code page of the current thread, used in Windows 2000 and later versions. I don't understand what it is
Cp_utf7 UTF-7, both lpdefaultchar and lpuseddefachar char must be null when this value is set
Cp_utf8 UTF-8, which must be null for both lpdefaultchar and lpuseddefachar char

I think cp_acp and cp_utf8 are the most common ones. The former converts wide characters to ANSI and the latter to utf8.

Dwflags: Specifies how to process non-converted characters. However, if this parameter is not set, the function runs faster. I set it to 0. The following table lists the configurable values:
Wc_no_best_fit_chars converts Unicode characters that cannot be directly converted into multi-byte characters to the Default Characters specified by lpdefaultchar. That is to say, if Unicode is converted into multi-byte characters and then converted back, you do not necessarily get the same UNICODE character, because the default character may be used during this period. This option can be used independently or with other options.

Wc_compositecheck converts the composite characters into premade characters. It can be used in combination with any of the last three options. If it is not in combination with any of them,

Wc_err_invalid_chars this option will cause the function to fail to return invalid characters, and getlasterror will return the error code error_no_unicode_translation. Otherwise, the function automatically discards invalid characters. This option can only be used for utf8.

Wc_discardns discards characters that do not occupy space during conversion and is used together with wc_compositecheck.
Separate characters are generated during wc_sepchars conversion. This is the default conversion option, which is used together with wc_compositecheck.
When converting wc_defachar char, use the default character instead of the exception character (the most common is '? '), Used with wc_compositecheck.

When wc_compositecheck is specified, the function converts the synthesized characters into premade characters. A composite Character consists of a base character and a space-free character (such as the phonetic alphabet of European countries and Chinese pinyin). Each character has a different character value. Premade characters have a single character value that represents the base character and the compositing body that does not occupy space.
When you specify the wc_compositecheck option, you can also use the last three options listed in the Table above to customize the conversion rules for premade characters. These options determine the behavior of the function when there are no pre-fabricated characters in the synthesis of wide strings. They are used with wc_compositecheck. If none of them are specified, the function defaults to wc_sepchars.

Dwflags must be 0 for the following code pages; otherwise, the function returns the error code error_invalid_flags.
50220 50221 50222 50225 50227 50229 52936 54936 to 57002 57011 (utf7) 42 (Symbol)
For utf8, dwflags must be 0 or wc_err_invalid_chars. Otherwise, the function returns a failure and sets the error code error_invalid_flags. You can call getlasterror to obtain it.

Lpwidecharstr: The width string to be converted.

Cchwidechar: the length of the string to be converted.-1 indicates that the string is converted to the end of the string.

Lpmultibytestr: the buffer zone of the new string output after receiving the conversion.

Cbmultibyte: size of the output buffer. If it is 0, lpmultibytestr will be ignored. The function will return the size of the required buffer instead of lpmultibytestr.

Lpdefaultchar: pointer to a character. This character is used as the default character when no corresponding character is found in the specified encoding. If it is null, the system default character is used. If the dwflags parameter is required to be null, the function returns an error and sets the error code error_invalid_parameter.

Lpuseddefaultchar: pointer of the switch variable, used to indicate whether the default character has been used. If the dwflags parameter is required to be null, the function returns an error and sets the error code error_invalid_parameter. Both lpdefaultchar and lpuseddefachar Char are set to null, and the function is faster.

Return Value: if the function is successful and cbmultibyte is not 0, the number of bytes written to lpmultibytestr (including null at the end of the string) is returned. If cbmultibyte is 0, the required conversion is returned.

The number of bytes. Function failed. 0 is returned.
Note: improper use of the function widechartomultibyte may affect program security. Calling this function can easily cause memory leakage because the size of the input buffer pointed to by lpwidecharstr is the number of characters in width, and the size of the output buffer pointed to by lpmultibytestr is the number of bytes. To avoid Memory leakage, make sure to specify an appropriate size for the output buffer. My method is to first make the cbmultibyte 0 call widechartomultibyte once to get the required buffer size, allocate space for the buffer, and then call widechartomultibyte to fill the buffer again. For details, see the following code. In addition, conversion from Unicode UTF16 to a non-Unicode Character Set may result in data loss because the character set may not be able to find characters that represent specific Unicode data.

Wchar_t * pwszunicode = "Holle, word! Hello, China! ";
Int isize;
Char * pszmultibyte;

Isize = widechartomultibyte (cp_acp, 0, pwszunicode,-1, null, 0, null, null );
Pszmultibyte = (char *) malloc (isize + 1)/** sizeof (char )*/);
Widechartomultibyte (cp_acp, 0, pwszunicode,-1, pszmultibyte, isize, null, null );

The second is the conversion function from multi-byte characters to wide characters. The function prototype is as follows:
> Int multibytetowidechar (
Uint codePage,
DWORD dwflags,
Lpcstr lpmultibytestr,
Int cbmultibyte,
Lpwstr lpwidecharstr,
Int cchwidechar
);

This function converts a multi-byte string to a wide string (UNICODE). The string to be converted is not necessarily multi-byte.

For more information about the parameters, returned values, and precautions of this function, see the preceding widechartomultibyte description. Here, we only briefly describe dwflags.

Dwflags: Specifies whether to convert to premade or merged wide characters, whether to use image text for control characters, and how to process invalid characters.

Mb_precomposed always uses premade characters. When a single premade character is used, the base character and space character are not used. This is the default function option and cannot be used with mb_composite.
Mb_composite always uses decomposition characters, that is, it always uses the base character + No space characters
Mb_err_invalid_chars sets this option. If the function encounters an invalid character, it fails and returns the error code error_no_unicode_translation. Otherwise, the invalid character is discarded.
Mb_useglyphchars uses image characters instead of control characters

Dwflags must be 0 for the following code pages; otherwise, the function returns the error code error_invalid_flags.
50220 50221 50222 50225 50227 50229 52936 54936 to 57002 57011 (utf7) 42 (Symbol)
For utf8, dwflags must be 0 or mb_err_invalid_chars. Otherwise, the function fails and the error code error_invalid_flags is returned.

I have never used the following functions. Just briefly describe them.

Int gettextcharset (HDC );

This function obtains the character set of the currently selected device description table, which is equivalent to gettextcharsetinfo (HDC, null, 0 ).

Returned value: the character set identifier is returned. If the returned value is default_charset, the returned value is default_charset.

Windows-defined Character Set Identifiers include:
Ansi_charset
Baltic_charset
Chinesebig5_charset
Default_charset
Easteurope_charset
Gb2312_charset
Greek_charset
Hangul_charset
Mac_charset
Oem_charset
Russian_charset
Shiftjis_charset
Symbol_charset
Turkish_charset
Vietnamese_charset

Johab_charset (Korean version)
Arabic_charset (Middle East Edition)
Hebrew_charset (Middle East Edition)
Thai_charset (Thai Version)

Int gettextcharsetinfo (
HDC,
Lpfontsignature lpsig,
DWORD dwflags
);

This function obtains the character set information of the fonts in the currently selected device description table.

Parameters:
Lpsig: pointer to the fontsignature structure. If the selected device is a turetype font, the function will fill in the fontsignature structure. Otherwise, the fontsignature structure is filled with 0, and the translatecharsetinfo should be used to obtain general character set information. If you do not need the font flag information, you can set this parameter to null, which is equivalent to calling gettextcharset.

The fontsignature structure includes the code page and Unicode subdomain information. The Unicode subdomain is the font contour of the specified font. The structure is defined as follows:
Typedef struct tagfontsignature {
DWORD fsusb [4];
DWORD fscsb [2];
} Fontsignature, * pfontsignature;
Fsusb is a 128-bit Unicode subset flag (UNICODE subset bitfield) that identifies up to 126 Unicode subdomains. In addition to the two highest valid bits, each bit represents a subdomain. The highest bit is always 1, indicating that this bit is a font sign, the next high reserved, must be 0. Unicode subdomains are developed according to the ISO 10646 standard.
Fscsb is a 64-bit code page flag used to identify the character set or code page. The code page is identified by a low 32-bit, while a high 32-bit is used for a non-Windows code page.

Dwflags: Reserved. It must be 0.
Returned value: the character set identifier is returned. If the returned value is default_charset, the returned value is default_charset.

Bool translatecharsetinfo (
DWORD * psrc,
Lpcharsetinfo lpcs,
DWORD dwflags
);

This function converts and sets the value of the target ry Based on the specified character set, code page, or font identifier.
Parameters:
Lpsrc: If dwflags is tci_srcfontsig, this parameter is the fscsb Member Address of the fontsignature structure; otherwise, it is a DWORD Value.
Lpcs: point to the charsetinfo structure to be filled.

The charsetinfo structure is defined as follows:
Typedef struct {
Uint cicharset; // Character Set Identifier
Uint ciacp; // ANSI code page ID
Fontsignature FS;
} Charsetinfo, * pcharsetinfo;

Dwflags: Specifies how to convert. It can be one of the following identifiers:
The tci_srccharset lpsrc Character Set Identifier is the character set identifier, and the height is 0.
The tci_srccodepage lpsrc logo is a code page identifier, And the logo is 0.
Tci_srcfontsig lpsrc is the code page flag of fontsignature, that is, the fscsb Member Address of fontsignature.
Tci_srclocale lpsrc is the lcid or langid of the keyboard layout. If it is langid, the value of lpsrc has a low character.

Bool isdbcsleadbyte (byte testchar );

This function uses the current windowsansi code page to determine whether the specified byte is a potential dubyte (DBCS) Pilot byte. If you use another code page, use the isdbcsleadbyteex function.

Bool isdbcsleadbyteex (
Uint codePage,
Byte testchar
);

This function determines whether the specified byte is a potential dubyte (DBCS) Pilot byte.
The available codePage values include cp_acp, cp_maccp, cp_oemcp, and cp_thread_acp.

Bool istextunicode (
Const void * pbuffer,
Int CB,
Lpint LPI
);

This function determines whether a specified buffer may contain Unicode text in a certain format.
Parameters:
LPI: indicates the test item when the input is made. When the output is made, the test Bit returns the test results. If the test is passed, 1 is returned. If the test fails, 0 is returned. If it is null, the function tests all possible items. It can be one or more of the following values:
Is_text_unicode_ascii16 the buffer text is Unicode and only contains 0 ASCII characters.
Is_text_unicode_reverse_ascii16 is the same as above, But Unicode text is flipped by byte (related to BOM)
The is_text_unicode_statistics buffer text may be Unicode. because it uses statistical analysis, it cannot ensure accuracy.
Is_text_unicode_reverse_statistics is the same as above, but the Unicode text may be flipped by bytes.
The is_text_unicode_controls buffer text contains one or more non-printable control characters represented by Unicode, such as return, linefeed, space, cjk_space, and tab.
Is_text_unicode_reverse_controls is the same as above. Only Unicode characters are flipped by bytes.
The amount of text in the is_text_unicode_buffer_too_small buffer is too small for meaningful analysis.
Is_text_unicode_signature buffer text contains Unicode bytecode (BOM) 0xfeff as its first character
Is_text_unicode_reverse_signature buffer text contains Unicode reverse bytecode 0xfffe as its first character
The is_text_unicode_illegal_chars buffer text contains invalid Unicode characters, such as embedded Bom, unicode_nul, CRLF, or 0 xFFFF.
Is_text_unicode_odd_length the number of buffer characters is odd and the Unicode string length is even
The is_text_unicode_null_bytes buffer contains NULL bytes, indicating non-ASCII text.
Is_text_unicode_unicode_mask is_text_unicode_ascii16 | is_text_unicode_statistics | is_text_unicode_controls | is_text_unicode_signature
Is_text_unicode_reverse_mask statistics | is_text_unicode_reverse_statistics | is_text_unicode_reverse_controls | is_text_unicode_reverse_signature.
Is_text_unicode_not_unicode_mask is_text_unicode_illegal_chars | is_text_unicode_odd_length and two unused flag Spaces
Is_text_unicode_not_ascii_mask is_text_unicode_null_bytes and three unused flag Spaces

There are also three functions not recommended for Windows:
Nlsdllcodepagetranslation, unicodetobytes, and bytestounicode are not described here.

PS: The following connection Errors often occur when the _ Unicode code is compiled in VC:

Msvcrtd. Lib (crtexew. OBJ): Error lnk2001: unresolved external symbol _ winmain @ 16

Debug/test.exe: Fatal error lnk1120: 1 unresolved externals

The solution is as follows:

Select output under project-> Settings-> link, and enter wwinmaincrtstartup in the entry.

It is the same as the option wc_sepchars.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More