Encoding conversion from GBK (gb2312) to UTF-8

Source: Internet
Author: User

Recently, I made an Internet Explorer plug-in to get text from the webpage and encode it into a URL. In the previous article "Chinese URL encoding", we roughly introduced the URL encoding rules and the process of Chinese URL encoding, but how to convert GBK or gb2312 to UTF-8 encoding is still a problem. Coding is a very complicated problem and I know little about it. Here I just write my experience. Please add and correct it.

In PHP and. net, encoding conversion is easier. Some Macros in ATL are used for encoding conversion. I have never tried it, and I prefer to use the method described later.

In COM programming, strings are mostly stored in the BSTR structure. Many articles on the Internet say that Unicode is stored in this data structure, I tried many times from Unicode to UTF-8, attempted. When debugging, BSTR containing a Chinese string can be normally displayed, indicating that its encoding should be GBK.

How to convert from GBK to UTF-8? Libiconv should be able to do this. However, after I use its Windows Port, I can compile and register the COM component, that is, the toolbar quit, so I gave up. Search online to obtain a cchinesecode class that is widely reproduced. However, it is only for Chinese characters (each Chinese Character occupies 3 bytes in UTF-8 encoding), if there is English in the string, it is troublesome, because English only one byte in UTF-8 encoding. Other characters occupy more bytes. So this class does not apply.

The correct method is to use the Win32 API's multibytetowidechar and widechartomultibyte functions, wide character refers to the conversion between Unicode. GBK and UTF-8, and Unicode needs to be used as a bridge (in this method ). For example, we want to convert such a string "encoding-Google Search ".

Conversion from GBK to Unicode

This string is stored in the BSTR type variable in. First, it is converted to a normal string:

char *lpszText = _com_util::ConvertBSTRToString(in);

In this case, if the strlen function is used to obtain the length of lpsztext, it is 18 or 4 Chinese characters, each of which occupies two bytes and has 10 other English characters. Therefore, GBK/gb2312 is multibyte instead of widechar. in addition, there is lpsztext [0] = 0xb1 & lpsztext [1] = 0xe0. on the Microsoft Windows codePage 936 page, it turns out to be "linear, it reinforces our belief that it is GBK.

The function used for conversion to Unicode is multibytetowidechar. The first parameter is the code page of multibyte. If it is determined to be GBK, 936 can be used. I think it should be related to the system (for example, it should be 932 on the Japanese system), so cp_acp is used, codePage used by the system.

First, set the cchwidechar parameter to 0 to obtain the size of the converted space and allocate the space, perform the actual conversion (when cbmultibyte is-1, the string to be converted ends with 0 ). The Code is as follows:

int wLen = MultiByteToWideChar(CP_ACP, 0, lpszText, -1, NULL, 0);LPWSTR wStr = (LPWSTR)CoTaskMemAlloc(wLen * sizeof(WCHAR));MultiByteToWideChar(CP_ACP, 0, lpszText, -1, wStr, wLen);

WLEN is 15 characters. Note that it refers to the number of wide characters. It is very considerate and contains 14 characters, plus the ending character at the end. When allocating space, you should also note that, instead of 15 bytes, 30 bytes should be allocated. These are described in msdn. Take a closer look at the cchwidechar parameter introduction. After the last line of code is executed, wstr contains the Unicode of these Chinese characters. Check that wstr [0] = 0 × 7f16 was found in Microsoft Windows codePage 936, 7f16 is its Unicode code, indicating that everything is normal.

Conversion from Unicode to UTF-8

After conversion to Unicode, you can use the widechartomultibyte function to convert it to UTF-8 encoding, this time the code page uses cp_utf8. as with the previous conversion, calculate the required space and allocate it, And then perform the actual conversion.

int aLen = WideCharToMultiByte(CP_UTF8, 0, wStr, -1, NULL, 0, NULL, NULL);char* converted = (char*)CoTaskMemAlloc(aLen);WideCharToMultiByte(CP_UTF8, 0, wStr, -1, converted, aLen, NULL, NULL);

Alen is 23, because each of the four Chinese characters occupies 3 bytes, plus 10 English characters (each occupies 1 byte), plus '/0' at the end, exactly 23. now converted is the UTF-8 code of the string "encoding-Google Search. Converted [0] = 0xe7 & converted [1] = 0xbc, which is the UTF-8 code of the word "linear.

Now we finally get the UTF-8 byte sequence of the Chinese and English strings, you can perform URL encoding (percent encoding.

If you have read the code of the cchinesecode class, you may wonder why the unicodetoutf_8 function is used to convert gb2312 to Unicode since the author knows how to use widechartomultibyte for conversion?

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.