Ansi string we are most familiar with, English occupies one byte, Chinese characters 2 bytes, ending with a \ 0, commonly used in txt text files.
Unicode string. Each character (Chinese character or English letter) occupies 2 bytes. In the VC ++ world, Microsoft prefers Unicode, such as wchar_t.
UTF8 is A form of Unicode compression. English A is expressed as 0x0041 in unicode. In English, this storage method is too wasteful because it wastes 50% of space, therefore, the English language is compressed into one byte, Which is UTF-8 encoded. However, Chinese characters occupy three bytes in utf8, which is obviously not as cost-effective as Chinese characters, this is why utf8 is commonly used for Chinese Web pages used for ansi encoding and foreign web pages. In the program, after converting the txt file in UTF8 format of 15.7M to ANSI, the size is only 10.8 M.
Generally, you can use the two functions in the Windows header file to convert each type. Add the header file:
#include <Windows.h>
Multi-Byte Character Set-> Unicode Character Set
__in DWORD dwFlags, __in LPCSTR lpMultiByteStr, __in cbMultiByte, __out LPWSTR lpWideCharStr, __in cchWideChar );
Unicode Character Set-> multi-Byte Character Set
__in DWORD dwFlags, __in LPCWSTR lpWideCharStr, __in cchWideChar, __out LPSTR lpMultiByteStr, __in cbMultiByte, );
Only when a character does not have a corresponding representation in the CodePage code page, WideCharToMultiByte uses the last two parameters. When a character cannot be converted, the function uses the character pointed to by the lpDefaultChar parameter. If this parameter points to NULL, the function uses a default character. The default value is usually a question mark. This is very dangerous for file operations, because the question mark is a wildcard.
Program header file:
<iostream><><fstream><Windows.h> std:: std;
ANSI to Unicode
* sAnsi = sLen = MultiByteToWideChar(CP_ACP, NULL, sAnsi, -, NULL, * sUnicode = MultiByteToWideChar(CP_ACP, NULL, sAnsi, -,); rtxt.write((*)sUnicode, sLen*=}
Unicode to ANSI
*sUnicode = L sLen = WideCharToMultiByte(CP_ACP, NULL, sUnicode, -, NULL, * sAnsi = WideCharToMultiByte(CP_ACP, NULL, sUnicode, -=}
Unicode to UTF8
*sUnicode = L sLen = WideCharToMultiByte(CP_UTF8, NULL, sUnicode, -, NULL, * sUtf8 = WideCharToMultiByte(CP_UTF8, NULL, sUnicode, -, );=
UTF8 to Unicode
* sUtf8 = sLen = MultiByteToWideChar(CP_UTF8, NULL, sUtf8, -, NULL, * sUnicode = -,*)sUnicode, sLen*=
Ansi conversion utf8 and utf8 conversion Ansi are the combination of the above two. unicode is used as the intermediate amount and can be converted twice.
During network transmission, we often use UTF8 encoding, but during program processing, we are used to ANSI encoding. At least the display of UTF8 code in VS2010 is garbled. The following functions integrate the above procedures to convert UTF8 encoding of txt files to ANSI encoding.
* changeTxtEncoding(* wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, -, NULL, * wszString = -<<wszString<< ansiLen = ::WideCharToMultiByte(CP_ACP, NULL, wszString, -, NULL, , NULL, NULL); * szAnsi = - changeTextFromUtf8ToAnsi( * strLine= strResult=(!+=strLine+* changeTemp= [strResult.length()+=; strcpy(changeTemp, strResult.c_str()); * changeResult==
Problem record:
A. the length () and size () Functions of the String type return the true size of the String, excluding '\ 0 ';
B. The strlen () function of the char * type also returns the true size of the string, excluding '\ 0 ';
C. Note that the sizeof () function contains '\ 0', for example, char str [] = "Hello"; then sizeof (str) = 6.