Character encoding (cont.)---Unicode and ANSI string conversions and resolved character encodings

Last Update:2015-09-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Unicode and ANSI string conversions

We use the Windows function MultiByteToWideChar to convert multibyte strings to wide-character strings, as follows:

int MultiByteToWideChar (    UINT ucodepage,    DWORD dwFlags,    pcstr pmultibytestr,    int Cbmultibyte,    Pwstr pwidecharstr,    int cchwidechar);

The Ucodepage parameter identifies a code page value associated with a multibyte string. The dwflags parameter allows us to take extra control, which affects characters with diacritical marks (such as accents). In general, however, these labels are not used

, so the parameter passed to it is 0. The pmultibytestr parameter specifies the string to be converted, and the Cbmultibyte parameter specifies the length (in bytes) of the string. if the value passed to the Cbmultibyte parameter is-1, the function can automatically determine the length of the source string. The function writes the converted Unicode version of a string to the memory buffer, whose memory address is specified by the PWIDECHARSTR parameter. The maximum length (number of characters) of this buffer must be specified in the Cchwidechar parameter. If MultiByteToWideChar is called, and the Cchwidechar parameter is passed in 0, the function does not perform the conversion, but instead returns a wide character count (including the terminating character ' + '), and the conversion succeeds only if the buffer is able to accommodate that number of wide characters. In general, follow these steps to convert a multibyte string into Unicode form.

(l) call MultiByteToWideChar, pass NULL for the PMULTIBYTESTR parameter, pass in 0 for the Cchwidechar parameter, and 1 for the Cbmultibyte parameter.

(2) Allocate a piece of memory sufficient to accommodate the converted Unicode string. Its size is the return value of the previous MultiByteToWideChar call multiplied by sizeof (wchar_t).

(3) Call MultiByteToWideChar again, this time passing the buffer address as the value of the PMULTIBYTESTR parameter, multiplying the return value of the first MultiByteToWideChar call by sizeof (wchar_t) The resulting size is passed in as the value of the Cchwidechar parameter.
(4) Use the converted string.
(5) frees a block of memory occupied by a Unicode string.
Correspondingly, the WideCharToMultiByte function converts a wide character string into a multibyte string, as follows:

int WideCharToMultiByte (    UINT ucodepage,    DWORD dwFlags,    pcwstr pwidecharstr,    int Cchwidechar,    PSTR pmultibytestr,    int  cbmultibyte,    pcstr Pdefaultchar,    pbool Pfuseddefaultchar);

This function is similar to the MultiByteToWideChar function. Similarly, Ucodepage identifies the code page to associate with the newly converted string. The dwflags parameter allows us to specify additional conversion controls. These flags affect characters with diacritical marks and characters that the system cannot convert. However, we generally do not need to do this degree of conversion control, so the dwflags parameter passed in 0.

The PWIDECHARSTR parameter specifies the memory address of the string to be converted, and the Cchwidechar parameter indicates the length (number of characters) of the string. If you pass in 1 for the Cchwidechar parameter, the function determines the length of the source string.

The converted multibyte version of the string is written to the buffer specified by the PMULTIBYTESTR parameter. The maximum size (in bytes) of this buffer must be specified in the Cbmultibyte parameter. When you call the WideCharToMultiByte function, passing 0 as the value of the Cbmultibyte parameter causes the function to return the desired size for the target buffer. When converting a wide-character string to a multibyte string, the steps taken are similar to the steps previously to convert a multibyte string to a wide string; the only difference is that the return value is directly the number of bytes required to ensure a successful conversion, so there is no need to perform a multiplication operation.

Note that the WideCharToMultiByte function accepts two more parameters than the MultiByteToWideChar function, respectively Pdefaultchar and Pfuseddefaultchar. The WideCharToMultiByte function uses both parameters only if one character does not have a corresponding representation in the code page specified by Ucodepage. When you encounter a wide character that cannot be converted, the function uses the character pointed to by the Pdefaultchar argument. If this parameter is null (which is a very common case), the function uses a system default character. This character is usually a question mark. This is very dangerous for the file name because the question mark is a wildcard character.

The Pfuseddefaultchar parameter points to a Boolean variable, and in a wide character string, if at least one character cannot be converted to the corresponding multibyte form, the function sets the variable to true. If all characters are converted successfully, the variable is set to False. We can test the variable after the function returns to verify that the wide character string has been successfully converted. Likewise, we usually pass in a null value for this parameter.

Resolving character encoding Forms

The Istextunicode function can tell if a character is ANSI or Unicode, and the function is exported by AdvApi32.dll and declared in WinBase.h:

int cb, PINT PResult);

The Istextunicode function uses a series of statistical and deterministic methods to guess what is in the buffer, and there is some error.

Its first parameter is Pvbuffer, which identifies the address of the buffer to be tested. This data is a void pointer because it is not yet known whether a set of ANSI characters or Unicode characters will be facing.

The second parameter is CB, which specifies the number of bytes in the buffer that the pvbuffer points to. Similarly, because you do not know what is in the buffer, CB is a number of bytes instead of characters. Note: We do not have to specify the length of the entire buffer. Of course, the more bytes the function tests, the more accurate the result.

The third parameter is Presult, which is the address of an integer, and we must initialize the integer before calling the Istextunicode function. In the initial value of this integer, you should indicate which tests you want Istextunicode to perform. You can also pass in NULL for it, in which case the Istextunicode function performs every test that it can perform.

If the Istextunicode function considers the buffer to contain Unicode text, it returns true, and vice versa. In the integer pointed to by the presult parameter, if a specific test project is specified, the function will also fully set the corresponding bit in this integer before returning to reflect the results of each test project.

Character encoding (cont.)---Unicode and ANSI string conversions and resolved character encodings

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Character encoding (cont.)---Unicode and ANSI string conversions and resolved character encodings

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Character encoding (cont.)---Unicode and ANSI string conversions and resolved character encodings

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support