ANSI, utf8, Unicode encoding (continued)

Source: Internet
Author: User

1. Review of three types of codes

ANSI string we are most familiar with, English occupies one byte, Chinese characters 2 bytes, ending with a \ 0, commonly used in TXT text files.
Unicode string. Each character (Chinese character or English letter) occupies 2 bytes. In the VC ++ world, Microsoft prefers Unicode, such as wchar_t.
Utf8 is a form of Unicode compression. English A is expressed as 0x0041 in Unicode. In English, this storage method is too wasteful because it wastes 50% of space, therefore, the English language is compressed into one byte, Which is UTF-8 encoded. However, Chinese characters occupy three bytes in utf8, which is obviously not as cost-effective as Chinese characters, this is why utf8 is commonly used for Chinese Web pages used for ANSI encoding and foreign web pages.ProgramAfter converting a TXT file of the utf8 format of 15.7m to ANSI, the file size is only 10.8 m.

2. conversion functions

Generally, you can use the two functions in the Windows header file to convert each type. Add the header file:

 
# Include <windows. h>

Multi-Byte Character Set-> Unicode Character Set

 Int  Multibytetowidechar (_ in uint codePage,  //  Identifies a multi-byte associatedCodePage Value _ In DWORD dwflags, //  Allow us to perform additional control, which will affect the characters with a variant symbol (such as accent. But it is not applicable in general. You can assign it to 0. _ In lpcstr lpmultibytestr, //  Parameter specifies the string to be converted _ In Int Cbmultibyte, //  Specify the length (number of bytes) of the string to be converted. If the parameter value is-1, the function can automatically determine the length of the source string. _ Out lpwstr lpwidecharstr, //  Specifies the memory address of the converted Unicode string _ In Int Cchwidechar // Specify the maximum length of the lpwidecharstr buffer.  //  If 0 is input, the function does not convert, but returns a wide Character Count (including the ending character '\ 0 '),  //  The conversion is successful only when the buffer zone can accommodate this number of wide characters. );

Unicode Character Set-> multi-Byte Character Set

 Int  Widechartomultibyte (_ in uint codePage,  //  Indicates the code page associated with the string to be converted. _ In DWORD dwflags, //  Additional conversion control is developed. Generally, this level of control is not required, but 0 is input for dwflag. _ In lpcwstr lpwidecharstr,//  Memory Address of the string to be converted _ In Int Cchwidechar, //  Specifies the length of the string. If-1 is input, the function determines the length of the string. _ Out lpstr lpmultibytestr, //  Buffer after conversion _ In Int Cbmultibyte, //  Specify the maximum size (number of bytes) of the lpmultibytestr buffer. If 0 is input, the function returns the size required by the target buffer.  _ In lpcstr lpdefaultchar, _ out lpbool lpuseddefaultchar  // If at least one character in a wide character string cannot be converted into a multi-byte format, the function sets this variable to true. If all characters can be converted successfully, this variable is set to false. This function is usually passed into a null value. );

Only when a character does not have a corresponding representation in the codePage code page, widechartomultibyte uses the last two parameters. When a character cannot be converted, the function uses the character pointed to by the lpdefaultchar parameter. If this parameter points to null, the function uses a default character. The default value is usually a question mark. This is very dangerous for file operations, because the question mark is a wildcard.

3. Program Implementation

Program header file:

/** Author: Hou Kai * Description: utf8, Unicode, and utf8 conversion * Date:*/# Include<Iostream># Include<String># Include<Fstream># Include<Windows. h>//Windows header filesUsingSTD ::String;Using NamespaceSTD;

ANSI to Unicode

 Void  Ansitounicode (){  Char * Sansi = "  ANSI to Unicode, ANSI to Unicode  "  ;  //  ANSI to Unicode      Int Slen = multibytetowidechar (cp_acp, null, sansi ,- 1 , Null, 0  ); Wchar_t * Sunicode = New  Wchar_t [slen];  // Wchar_t * sunicode = (wchar_t *) malloc (slen * sizeof (wchar_t )); Multibytetowidechar (cp_acp, null, sansi ,- 1  , Sunicode, slen); ofstream rtxt (  "  Ansitouni.txt  "  ); Rtxt. Write (  "  \ XFF \ xfe  " , 2 ); //  For the reason, see the previous article-"Small Tail" Storage in byte order Rtxt. Write ((Char *) Sunicode, slen * Sizeof  (Wchar_t); rtxt. Close (); Delete [] sunicode; sunicode = NULL;  //  Free (sunicode ); }

Unicode to ANSI

 Void  Unicodetoansi () {wchar_t * Sunicode = L "  Convert Unicode to ANSI, Unicode to ANSI  "  ;  // Unicode to ANSI      Int Slen = widechartomultibyte (cp_acp, null, sunicode ,- 1 , Null, 0  , Null, null );  Char * Sansi = New   Char  [Slen];  //  Char * sansi = (char *) malloc (slen ); Widechartomultibyte (cp_acp, null, sunicode ,- 1  , Sansi, slen, null, null); ofstream rtxt ( "  Unitoansi.txt  "  ); Rtxt. Write (sansi, slen); rtxt. Close (); Delete [] sansi; sansi = NULL;  //  Free (sansi ); }

Unicode to utf8

 Void  Unicodetoutf8 () {wchar_t * Sunicode = L "  Convert Unicode to utf8, Unicode to utf8  "  ; //  Unicode to utf8      Int Slen = widechartomultibyte (cp_utf8, null, sunicode ,- 1 , Null, 0  , Null, null );  //  Utf8 is a unicode compression format, but it is also a multi-byte string, so it can be saved as char      Char * Suttf8 = New   Char  [Slen];  //  The strlen for Unicode is wcslen. Widechartomultibyte (cp_utf8, null, sunicode ,- 1  , Sutf8, slen, null, null); ofstream rtxt (  "  Unitoutf8.txt  "  ); Rtxt. Write (  "  \ XeF \ xbb \ xbf  " , 3 ); //  For the reason, see the previous article.  Rtxt. Write (suttf8. slen); rtxt. Close (); Delete [] suttf8. suttf8. =NULL ;} 

Utf8 to Unicode

 Void  Utf8tounicode (){  //  Utf8 convert to Unicode, utf8 is converted to Unicode, and "converted to" is opened in the UE hexadecimal format. The garbled characters are directly copied and represented in hexadecimal notation.      Char * Suttf8 = "  Utf8 convert to Unicode, utf8 \ xe8 \ xbd \ xac \ xe6 \ x8d \ xa2 \ xe4 \ xb8 \ Xba Unicode  "  ;  //  Utf8 to Unicode      Int Slen = multibytetowidechar (cp_utf8, null, suttf8 ,- 1 , Null, 0  ); Wchar_t * Sunicode = New  Wchar_t [slen]; multibytetowidechar (cp_utf8, null, sutf8, - 1  , Sunicode, slen); ofstream rtxt (  "  Utf8touni.txt  "  ); Rtxt. Write (  "  \ XFF \ xfe " , 2  ); Rtxt. Write ((  Char *) Sunicode, slen * Sizeof  (Wchar_t); rtxt. Close (); Delete [] sunicode; sunicode = NULL ;} 

ANSI conversion utf8 and utf8 conversion ANSI are the combination of the above two. Unicode is used as the intermediate amount and can be converted twice.

4. utf8 to ANSI

During network transmission, we often use utf8 encoding, but during program processing, we are used to ANSI encoding. At least the display of utf8 code in vs2010 is garbled. The following functions integrate the above procedures to convert utf8 encoding of TXT files to ANSI encoding.

 //  Changetxtencoding  Char * Changetxtencoding ( Char * Szu8 ){  Int Wcslen =: multibytetowidechar (cp_utf8, null, szu8 ,- 1 , Null, 0  ); Wchar_t * Wszstring = New  Wchar_t [wcslen];: multibytetowidechar (cp_utf8, null, szu8, - 1  , Wszstring, wcslen); cout <Wszstring < Endl;  Int Ansilen =: widechartomultibyte (cp_acp, null, wszstring ,- 1 , Null, 0 , Null, null ); //  Wcslen (wszstring)      Char * Szansi = New   Char  [Ansilen];: widechartomultibyte (cp_acp, null, wszstring, - 1  , Szansi, ansilen, null, null); Delete [] wszstring;  Return  Szansi ;} Void Changetextfromutf8toansi ( Const   Char * Filename) {ifstream infile;  String Strline = ""  ;  String Strresult = ""  ; Infile. Open (filename); infile. seekg (  3  , IOS: Beg );  If  (Infile ){ While (! Infile. EOF () {Getline (infile, strline); strresult + = Strline + "  \ N  "  ;}} Infile. Close ();  Char * Changetemp = New   Char [Strresult. Length () + 1  ]; Changetemp [strresult. Length ()] = '  \ 0 ' ; //  Problem records Strcpy (changetemp, strresult. c_str ()); //  Const char * method for converting char *      Char * Changeresult = Changetxtencoding (changetemp); strresult = Changeresult; ofstream OUTFILE; OUTFILE. Open (  "  Ansi.txt  " ); OUTFILE. write (strresult. c_str (), strresult. length (); OUTFILE. flush (); OUTFILE. close (); Delete [] changeresult; Delete [] changetemp ;} 

Problem record:
About the length of a stringA. the length () and size () Functions of the string type return the true size of the string, excluding '\ 0 ';
B. The strlen () function of the char * type also returns the true size of the string, excluding '\ 0 ';
C. Note that the sizeof () function contains '\ 0', for example, char STR [] = "hello"; then sizeof (STR) = 6.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.