Open a Notepad under Windows and save the file with four encoding choices below. ANSI, which is the multibyte character set, is the char (char) string in VC. Unicode, which is UTF16, is the WCHAR (wchar_t) string in VC. Unicode big endian, is UTF32, this kind of coding uses relatively few. UTF8, almost all of the pages are Utf8,utf8 with 1-4 bytes to encode all the characters, English only need 1 bytes, Chinese need 3-4 bytes. UTF8 can save network bandwidth as much as possible, because most of the characters that are transmitted on the network are mostly English, compared to UTF16. The UTF16 is at least 2 bytes, with a partial character of 4 bytes.
If we write a VC program, from getting HTML Web page data, the encoding of these data is UTF8, get to our VC program in the char character array will be found, English can be normal display, Chinese all garbled. Because our char string is ANSI-encoded. There are generally two ways to convert UTF8 to ANSI. One is the manual code implementation, Baidu search can find a lot of information, a thorough understanding of these character set coding, you can manually realize the conversion, online also have a lot of other people write a good conversion function. One way is to use a Third-party function library. Since we write programs under the Windows platform, we can use API functions to convert MultiByteToWideChar and WideCharToMultiByte. Using this function, we have to do two transitions, first use MultiByteToWideChar to convert the UTF8 encoded char string into a WCHAR string, the first parameter to indicate the code page we want to convert to Cp_utf8, that is, the meaning of UTF8. Then use the WideCharToMultiByte bar to convert the WCHAR string to a char string, the first parameter using the 936,936 code page means Simplified Chinese. About the code page knowledge can be Baidu encyclopedia.
The two ANSI and UTF8 functions I write are posted below. Parameter is a CString string in MFC, and if you want to pass in a C-style character array string, you need to modify it slightly. //utf8 to ANSI void Utf8toansi (CString &strutf8) { //Get the buffer size required to convert to a multiple-character section, create a multibyte buffer UINT nlen = MultiByteToWideChar (Cp_utf8, Null,strutf8,-1,null,null); WCHAR *wszbuffer = new wchar[nlen+1]; Nlen = MultiByteToWideChar (Cp_utf8,null,strutf8,-1,wszbuffer,nlen); Wszbuffer[nlen] = 0;
nlen = WideCharToMultiByte (936,null,wszbuffer,-1,null,null,null,null); CHAR *szbuffer = new char[nlen+1]; Nlen = WideCharToMultiByte (936,null,wszbuffer,-1,szbuffer,nlen,null,null); Szbuffer[nlen] = 0; strUTF8 = szbuffer; //Clean up memory delete []szbuffer; delete []wszbuffer; }
UTF8 to ANSI
void Utf8toansi (CString &strutf8)
{
//Get the buffer size required to convert to a multiple-character section, create a multi-byte buffer
UINT nlen = MultiByteToWideChar (cp_utf8,null,strutf8,-1,null,null);
WCHAR *wszbuffer = new wchar[nlen+1];
Nlen = MultiByteToWideChar (Cp_utf8,null,strutf8,-1,wszbuffer,nlen);
Wszbuffer[nlen] = 0;
Nlen = WideCharToMultiByte (936,null,wszbuffer,-1,null,null,null,null);
CHAR *szbuffer = new char[nlen+1];
Nlen = WideCharToMultiByte (936,null,wszbuffer,-1,szbuffer,nlen,null,null);
Szbuffer[nlen] = 0;
StrUTF8 = Szbuffer;
Clean memory
Delete []szbuffer;
delete []wszbuffer;
}
//ansi Turn UTF8
void ANSItoUTF8 (CString &stransi)
{
//Get the buffer size required to convert to wide bytes, create a wide-byte buffer, 936 is the Simplified Chinese GB2312 code page
UINT Nlen = MultiByteToWideChar (936,null,stransi,-1,null,null);
WCHAR *wszbuffer = new wchar[nlen+1];
Nlen = MultiByteToWideChar (936,null,stransi,-1,wszbuffer,nlen);
Wszbuffer[nlen] = 0;
//Get the buffer size required to convert to UTF8 and create multibyte buffers
Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,null,null,null,null);
CHAR *szbuffer = new char[nlen+1];
Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,szbuffer,nlen,null,null);
Szbuffer[nlen] = 0;
stransi = szbuffer;
//Memory cleanup
delete []wszbuffer;
delete []szbuffer;
}
ANSI UTF8
void ANSItoUTF8 (CString &stransi)
{
//Get the buffer size needed to convert to a wide byte, create a wide-byte buffer, 936 is a simplified Chinese GB2312 code page
UINT Nlen = MultiByteToWideChar (936,null,stransi,-1,null,null);
WCHAR *wszbuffer = new wchar[nlen+1];
Nlen = MultiByteToWideChar (936,null,stransi,-1,wszbuffer,nlen);
Wszbuffer[nlen] = 0;
Gets the buffer size required to convert to UTF8, creating a multi-byte buffer
Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,null,null,null,null);
CHAR *szbuffer = new char[nlen+1];
Nlen = WideCharToMultiByte (cp_utf8,null,wszbuffer,-1,szbuffer,nlen,null,null);
Szbuffer[nlen] = 0;
Stransi = Szbuffer;
Memory Cleanup
Delete []wszbuffer;
delete []szbuffer;
}
It is noteworthy that the UTF8 encoded string is typically stored in a char (char) type array, but not in a WCHAR (wchar_t) type array. Why, then? Because the UTF8 encoded string is 1-4 bytes per character, and some characters only account for 1 bytes, it should be saved with a char-type array. And WCHAR, a WCHAR is two bytes, for a character that needs only one byte, it will go wrong.