1. Review of three types of codes
ANSI string we are most familiar with, English occupies one byte, Chinese characters 2 bytes, ending with a \ 0, commonly used in TXT text files.
Unicode string. Each character (Chinese character or English letter) occupies 2 bytes. In the VC ++ world, Microsoft prefers Unicode, such as wchar_t.
Utf8 is a form of Unicode compression. English A is expressed as 0x0041 in Unicode. In English, this storage method is too wasteful because it wastes 50% of space, therefore, the English language is compressed into one byte, Which is UTF-8 encoded. However, Chinese characters occupy three bytes in utf8, which is obviously not as cost-effective as Chinese characters, this is why utf8 is commonly used for Chinese Web pages used for ANSI encoding and foreign web pages.ProgramAfter converting a TXT file of the utf8 format of 15.7m to ANSI, the file size is only 10.8 m.
2. conversion functions
Generally, you can use the two functions in the Windows header file to convert each type. Add the header file:
# Include <windows. h>
Multi-Byte Character Set-> Unicode Character Set
Int Multibytetowidechar (_ in uint codePage, // Identifies a multi-byte associatedCodePage Value _ In DWORD dwflags, // Allow us to perform additional control, which will affect the characters with a variant symbol (such as accent. But it is not applicable in general. You can assign it to 0. _ In lpcstr lpmultibytestr, // Parameter specifies the string to be converted _ In Int Cbmultibyte, // Specify the length (number of bytes) of the string to be converted. If the parameter value is-1, the function can automatically determine the length of the source string. _ Out lpwstr lpwidecharstr, // Specifies the memory address of the converted Unicode string _ In Int Cchwidechar // Specify the maximum length of the lpwidecharstr buffer. // If 0 is input, the function does not convert, but returns a wide Character Count (including the ending character '\ 0 '), // The conversion is successful only when the buffer zone can accommodate this number of wide characters. );
Unicode Character Set-> multi-Byte Character Set
Int Widechartomultibyte (_ in uint codePage, // Indicates the code page associated with the string to be converted. _ In DWORD dwflags, // Additional conversion control is developed. Generally, this level of control is not required, but 0 is input for dwflag. _ In lpcwstr lpwidecharstr,// Memory Address of the string to be converted _ In Int Cchwidechar, // Specifies the length of the string. If-1 is input, the function determines the length of the string. _ Out lpstr lpmultibytestr, // Buffer after conversion _ In Int Cbmultibyte, // Specify the maximum size (number of bytes) of the lpmultibytestr buffer. If 0 is input, the function returns the size required by the target buffer. _ In lpcstr lpdefaultchar, _ out lpbool lpuseddefaultchar // If at least one character in a wide character string cannot be converted into a multi-byte format, the function sets this variable to true. If all characters can be converted successfully, this variable is set to false. This function is usually passed into a null value. );
Only when a character does not have a corresponding representation in the codePage code page, widechartomultibyte uses the last two parameters. When a character cannot be converted, the function uses the character pointed to by the lpdefaultchar parameter. If this parameter points to null, the function uses a default character. The default value is usually a question mark. This is very dangerous for file operations, because the question mark is a wildcard.
3. Program Implementation
Program header file:
/** Author: Hou Kai * Description: utf8, Unicode, and utf8 conversion * Date:*/# Include<Iostream># Include<String># Include<Fstream># Include<Windows. h>//Windows header filesUsingSTD ::String;Using NamespaceSTD;
ANSI to Unicode
Void Ansitounicode (){ Char * Sansi = " ANSI to Unicode, ANSI to Unicode " ; // ANSI to Unicode Int Slen = multibytetowidechar (cp_acp, null, sansi ,- 1 , Null, 0 ); Wchar_t * Sunicode = New Wchar_t [slen]; // Wchar_t * sunicode = (wchar_t *) malloc (slen * sizeof (wchar_t )); Multibytetowidechar (cp_acp, null, sansi ,- 1 , Sunicode, slen); ofstream rtxt ( " Ansitouni.txt " ); Rtxt. Write ( " \ XFF \ xfe " , 2 ); // For the reason, see the previous article-"Small Tail" Storage in byte order Rtxt. Write ((Char *) Sunicode, slen * Sizeof (Wchar_t); rtxt. Close (); Delete [] sunicode; sunicode = NULL; // Free (sunicode ); }
Unicode to ANSI
Void Unicodetoansi () {wchar_t * Sunicode = L " Convert Unicode to ANSI, Unicode to ANSI " ; // Unicode to ANSI Int Slen = widechartomultibyte (cp_acp, null, sunicode ,- 1 , Null, 0 , Null, null ); Char * Sansi = New Char [Slen]; // Char * sansi = (char *) malloc (slen ); Widechartomultibyte (cp_acp, null, sunicode ,- 1 , Sansi, slen, null, null); ofstream rtxt ( " Unitoansi.txt " ); Rtxt. Write (sansi, slen); rtxt. Close (); Delete [] sansi; sansi = NULL; // Free (sansi ); }
Unicode to utf8
Void Unicodetoutf8 () {wchar_t * Sunicode = L " Convert Unicode to utf8, Unicode to utf8 " ; // Unicode to utf8 Int Slen = widechartomultibyte (cp_utf8, null, sunicode ,- 1 , Null, 0 , Null, null ); // Utf8 is a unicode compression format, but it is also a multi-byte string, so it can be saved as char Char * Suttf8 = New Char [Slen]; // The strlen for Unicode is wcslen. Widechartomultibyte (cp_utf8, null, sunicode ,- 1 , Sutf8, slen, null, null); ofstream rtxt ( " Unitoutf8.txt " ); Rtxt. Write ( " \ XeF \ xbb \ xbf " , 3 ); // For the reason, see the previous article. Rtxt. Write (suttf8. slen); rtxt. Close (); Delete [] suttf8. suttf8. =NULL ;}
Utf8 to Unicode
Void Utf8tounicode (){ // Utf8 convert to Unicode, utf8 is converted to Unicode, and "converted to" is opened in the UE hexadecimal format. The garbled characters are directly copied and represented in hexadecimal notation. Char * Suttf8 = " Utf8 convert to Unicode, utf8 \ xe8 \ xbd \ xac \ xe6 \ x8d \ xa2 \ xe4 \ xb8 \ Xba Unicode " ; // Utf8 to Unicode Int Slen = multibytetowidechar (cp_utf8, null, suttf8 ,- 1 , Null, 0 ); Wchar_t * Sunicode = New Wchar_t [slen]; multibytetowidechar (cp_utf8, null, sutf8, - 1 , Sunicode, slen); ofstream rtxt ( " Utf8touni.txt " ); Rtxt. Write ( " \ XFF \ xfe " , 2 ); Rtxt. Write (( Char *) Sunicode, slen * Sizeof (Wchar_t); rtxt. Close (); Delete [] sunicode; sunicode = NULL ;}
ANSI conversion utf8 and utf8 conversion ANSI are the combination of the above two. Unicode is used as the intermediate amount and can be converted twice.
4. utf8 to ANSI
During network transmission, we often use utf8 encoding, but during program processing, we are used to ANSI encoding. At least the display of utf8 code in vs2010 is garbled. The following functions integrate the above procedures to convert utf8 encoding of TXT files to ANSI encoding.
// Changetxtencoding Char * Changetxtencoding ( Char * Szu8 ){ Int Wcslen =: multibytetowidechar (cp_utf8, null, szu8 ,- 1 , Null, 0 ); Wchar_t * Wszstring = New Wchar_t [wcslen];: multibytetowidechar (cp_utf8, null, szu8, - 1 , Wszstring, wcslen); cout <Wszstring < Endl; Int Ansilen =: widechartomultibyte (cp_acp, null, wszstring ,- 1 , Null, 0 , Null, null ); // Wcslen (wszstring) Char * Szansi = New Char [Ansilen];: widechartomultibyte (cp_acp, null, wszstring, - 1 , Szansi, ansilen, null, null); Delete [] wszstring; Return Szansi ;} Void Changetextfromutf8toansi ( Const Char * Filename) {ifstream infile; String Strline = "" ; String Strresult = "" ; Infile. Open (filename); infile. seekg ( 3 , IOS: Beg ); If (Infile ){ While (! Infile. EOF () {Getline (infile, strline); strresult + = Strline + " \ N " ;}} Infile. Close (); Char * Changetemp = New Char [Strresult. Length () + 1 ]; Changetemp [strresult. Length ()] = ' \ 0 ' ; // Problem records Strcpy (changetemp, strresult. c_str ()); // Const char * method for converting char * Char * Changeresult = Changetxtencoding (changetemp); strresult = Changeresult; ofstream OUTFILE; OUTFILE. Open ( " Ansi.txt " ); OUTFILE. write (strresult. c_str (), strresult. length (); OUTFILE. flush (); OUTFILE. close (); Delete [] changeresult; Delete [] changetemp ;}
Problem record:
About the length of a stringA. the length () and size () Functions of the string type return the true size of the string, excluding '\ 0 ';
B. The strlen () function of the char * type also returns the true size of the string, excluding '\ 0 ';
C. Note that the sizeof () function contains '\ 0', for example, char STR [] = "hello"; then sizeof (STR) = 6.