VC + + Processing UTF8 encoded string __HTML5

Source: Internet
Author: User

Open a Notepad under Windows and save the file with four encoding choices below. ANSI, which is the multibyte character set, is the char (char) string in VC. Unicode, which is UTF16, is the WCHAR (wchar_t) string in VC. Unicode big endian, is UTF32, this kind of coding uses relatively few. UTF8, almost all of the pages are Utf8,utf8 with 1-4 bytes to encode all the characters, English only need 1 bytes, Chinese need 3-4 bytes. UTF8 can save network bandwidth as much as possible, because most of the characters that are transmitted on the network are mostly English, compared to UTF16. The UTF16 is at least 2 bytes, with a partial character of 4 bytes.

If we write a VC program, from getting HTML Web page data, the encoding of these data is UTF8, get to our VC program in the char character array will be found, English can be normal display, Chinese all garbled. Because our char string is ANSI-encoded. There are generally two ways to convert UTF8 to ANSI. One is the manual code implementation, Baidu search can find a lot of information, a thorough understanding of these character set coding, you can manually realize the conversion, online also have a lot of other people write a good conversion function. One way is to use a Third-party function library. Since we write programs under the Windows platform, we can use API functions to convert MultiByteToWideChar and WideCharToMultiByte. Using this function, we have to do two transitions, first use MultiByteToWideChar to convert the UTF8 encoded char string into a WCHAR string, the first parameter to indicate the code page we want to convert to Cp_utf8, that is, the meaning of UTF8. Then use the WideCharToMultiByte bar to convert the WCHAR string to a char string, the first parameter using the 936,936 code page means Simplified Chinese. About the code page knowledge can be Baidu encyclopedia.

The two ANSI and UTF8 functions I write are posted below. Parameter is a CString string in MFC, and if you want to pass in a C-style character array string, you need to modify it slightly. //utf8 to ANSI void Utf8toansi (CString &strutf8) { //Get the buffer size required to convert to a multiple-character section, create a multibyte buffer UINT nlen = MultiByteToWideChar (Cp_utf8, Null,strutf8,-1,null,null); WCHAR *wszbuffer = new wchar[nlen+1]; Nlen = MultiByteToWideChar (Cp_utf8,null,strutf8,-1,wszbuffer,nlen); Wszbuffer[nlen] = 0;
nlen = WideCharToMultiByte (936,null,wszbuffer,-1,null,null,null,null); CHAR *szbuffer = new char[nlen+1]; Nlen = WideCharToMultiByte (936,null,wszbuffer,-1,szbuffer,nlen,null,null); Szbuffer[nlen] = 0; strUTF8 = szbuffer; //Clean up memory delete []szbuffer; delete []wszbuffer; }

UTF8 to ANSI
void Utf8toansi (CString &strutf8)
{
	//Get the buffer size required to convert to a multiple-character section, create a multi-byte buffer
	UINT nlen = MultiByteToWideChar (cp_utf8,null,strutf8,-1,null,null);
	WCHAR *wszbuffer = new wchar[nlen+1];
	Nlen = MultiByteToWideChar (Cp_utf8,null,strutf8,-1,wszbuffer,nlen);
	Wszbuffer[nlen] = 0;

	Nlen = WideCharToMultiByte (936,null,wszbuffer,-1,null,null,null,null);
	CHAR *szbuffer = new char[nlen+1];
	Nlen = WideCharToMultiByte (936,null,wszbuffer,-1,szbuffer,nlen,null,null);
	Szbuffer[nlen] = 0;
	
	StrUTF8 = Szbuffer;
	Clean memory
	Delete []szbuffer;
	delete []wszbuffer;
}
//ansi Turn UTF8 void ANSItoUTF8 (CString &stransi) { //Get the buffer size required to convert to wide bytes, create a wide-byte buffer, 936 is the Simplified Chinese GB2312 code page UINT Nlen = MultiByteToWideChar (936,null,stransi,-1,null,null); WCHAR *wszbuffer = new wchar[nlen+1]; Nlen = MultiByteToWideChar (936,null,stransi,-1,wszbuffer,nlen); Wszbuffer[nlen] = 0; //Get the buffer size required to convert to UTF8 and create multibyte buffers Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,null,null,null,null); CHAR *szbuffer = new char[nlen+1]; Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,szbuffer,nlen,null,null); Szbuffer[nlen] = 0; stransi = szbuffer; //Memory cleanup delete []wszbuffer; delete []szbuffer; }
ANSI UTF8
void ANSItoUTF8 (CString &stransi)
{
	//Get the buffer size needed to convert to a wide byte, create a wide-byte buffer, 936 is a simplified Chinese GB2312 code page
	UINT Nlen = MultiByteToWideChar (936,null,stransi,-1,null,null);
	WCHAR *wszbuffer = new wchar[nlen+1];
	Nlen = MultiByteToWideChar (936,null,stransi,-1,wszbuffer,nlen);
	Wszbuffer[nlen] = 0;
	Gets the buffer size required to convert to UTF8, creating a multi-byte buffer
	Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,null,null,null,null);
	CHAR *szbuffer = new char[nlen+1];
	Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,szbuffer,nlen,null,null);
	Szbuffer[nlen] = 0;
	
	Stransi = Szbuffer;
	Memory Cleanup
	Delete []wszbuffer;
	delete []szbuffer;
}


It is noteworthy that the UTF8 encoded string is typically stored in a char (char) type array, but not in a WCHAR (wchar_t) type array. Why, then? Because the UTF8 encoded string is 1-4 bytes per character, and some characters only account for 1 bytes, it should be saved with a char-type array. And WCHAR, a WCHAR is two bytes, for a character that needs only one byte, it will go wrong.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.