Conversion and writing between ANSI, Unicode, and utf8 strings in C ++

Source: Internet
Author: User

Copyright Disclaimer: During reprinting, please use hyperlinks to indicate the original source and author information of the article and this statement
Http://dark0729.blogbus.com/logs/51496111.html

ANSI string we are most familiar with, English occupies one byte, Chinese characters 2 bytes, ending with a \ 0, commonly used in TXT text files

Unicode string. Each character (Chinese character or English letter) occupies two bytes and ends with two consecutive \ 0 characters. This string is used by the NT operating system kernel, it is often defined as typedef unsigned short wchar_t; so we often see errors such as char * cannot be converted to unsigned short *, which is actually Unicode

Utf8 is a form of Unicode compression. English A is expressed as 0x0041 in Unicode. foreigners think this storage method is too wasteful because it wastes 50% of space, therefore, the English language is compressed into one byte, Which is UTF-8 encoded. However, Chinese characters occupy three bytes in utf8, which is obviously not as cost-effective as Chinese characters, this is why Chinese Web pages are commonly used for utf8 encoding while foreigners use it for ANSI encoding.

Utf8 is widely used in games, such as wow Lua scripts.


Next, let's take a look at the conversion, mainly using code to describe it.

I used the cfile class for file writing. In fact, the same is true for file *. Writing a file has nothing to do with the category of the string. The hardware only cares about the data and length.


ANSI to Unicode

Two methods are introduced.

Void cconvertdlg: onbnclickedbuttonansitounicode () {// ANSI to Unicode char * szansi = "abcd1234 you and me"; // pre-convert to get the size of the required space int wcslen = :: multibytetowidechar (cp_acp, null, szansi, strlen (szansi), null, 0); // leave a space for '\ 0' to allocate space, multibytetowidechar does not give '\ 0' space wchar_t * wszstring = new wchar_t [wcslen + 1]; // conversion: multibytetowidechar (cp_acp, null, szansi, strlen (szansi), wszstring, wcslen); // Add '\ 0' wszstring [wcslen] =' \ 0'; // Unicode MessageBox API: messageboxw (getsafehwnd (), wszstring, wszstring, mb_ OK); // write the text file to the next step. The first two bytes are 0 xfeff and the low 0xff values are written in the cfile; cfile. open (_ T ("1.txt"), cfile: modewrite | cfile: modecreate); // cfile at the beginning of the file. seektobegin (); cfile. write ("\ xFF \ xfe", 2); // write content cfile. write (wszstring, wcslen * sizeof (wchar_t); cfile. flush (); cfile. close (); Delete [] wszstring; wszstring = NULL; // method 2 // set the current region information. If this parameter is not set, use this method, chinese characters are not displayed correctly // # include <locale. h> setlocale (lc_ctype, "CHS"); wchar_t wcsstr [100]; // note that the following is uppercase S, in UNICODE, stands for the ANSI string // swprintf is the Unicode version of sprintf // format before increasing the write L, stands for Unicode swprintf (wcsstr, l "% s", szansi );:: messageboxw (getsafehwnd (), wcsstr, wcsstr, mb_ OK );}

Unicode to ANSI
There are also two methods

Void cconvertdlg: onbnclickedbuttonunicodetoansi () {// Unicode to ANSI wchar_t * wszstring = l "abcd1234 you and me"; // pre-convert to get the size of the required space, int ansilen =: widechartomultibyte (cp_acp, null, wszstring, wcslen (wszstring), null, 0, null); // same as above, to allocate space, leave a space char * szansi = new char [ansilen + 1] for '\ 0'; // The strlen for conversion // Unicode is wcslen :: widechartomultibyte (cp_acp, null, wszstring, wcslen (wszstring), szansi, ansilen, null, null); // Add '\ 0' szansi [ansilen] =' \ 0 '; // ANSI version of MessageBox API: messageboxa (getsafehwnd (), szansi, szansi, mb_ OK); // write the text file to the next step // write the text file. The ANSI file does not have BOM cfile; cfile. open (_ T ("1.txt"), cfile: modewrite | cfile: modecreate); // cfile at the beginning of the file. seektobegin (); // write cfile. write (szansi, ansilen * sizeof (char); cfile. flush (); cfile. close (); Delete [] szansi; szansi = NULL; // method 2 // There is another method setlocale (lc_ctype, "CHS") Like above "); char szstr [100]; // note that the following is an uppercase string. in ANSI, it indicates a unicode string. // sprintf (szstr, "% s", wszstring );:: messageboxa (getsafehwnd (), szstr, szstr, mb_ OK );}

Unicode to utf8

Void cconvertdlg: onbnclickedbuttonunicodetou8 () {// Unicode to utf8 wchar_t * wszstring = l "abcd1234 you and me"; // pre-convert to get the size of the required space, the function used this time is opposite to the above name int u8len =: widechartomultibyte (cp_utf8, null, wszstring, wcslen (wszstring), null, 0, null, null); // same as above, to allocate space, leave a space for '\ 0'. // although utf8 is a unicode compression format, it is also a multi-byte string, therefore, char * szu8 = new char [u8len + 1] can be saved as char; // The strlen for conversion // Unicode is wcslen: widechartomultibyte (cp_utf8, null, wszstring, wcslen (wszstring), szu8, u8len, null, null); // Add '\ 0' szu8 [u8len] =' \ 0 '; // MessageBox does not support utf8. Therefore, you can only write files // write the following text // write the text file. The BOM of utf8 is 0 xbfbbef cfile; cfile. open (_ T ("1.txt"), cfile: modewrite | cfile: modecreate); // cfile at the beginning of the file. seektobegin (); // write the BOM, which is also low in the front cfile. write ("\ XeF \ xbb \ xbf", 3); // write cfile. write (szu8, u8len * sizeof (char); cfile. flush (); cfile. close (); Delete [] szu8; szu8 = NULL ;}

Utf8 to Unicode

Void cconvertdlg: onbnclickedbuttonu8tounicode, therefore, the hexadecimal format char * szu8 = "abcd1234 \ xe4 \ xbd \ xa0 \ xe6 \ x88 \ x91 \ xe4 \ xbb \ x96 \ x00"; // pre-conversion, get the size of the required space int wcslen =: multibytetowidechar (cp_utf8, null, szu8, strlen (szu8), null, 0 ); // leave a space for '\ 0' to allocate space. multibytetowidechar does not give' \ 0' space wchar_t * wszstring = new wchar_t [wcslen + 1]; // conversion :: multibytetowidechar (cp_utf8, null, szu8, strlen (szu8), wszstring, wcslen); // Add '\ 0' wszstring [wcslen] =' \ 0 '; // MessageBox API for Unicode: messageboxw (getsafehwnd (), wszstring, wszstring, mb_ OK); // write the text in the same way as ANSI to Unicode}

ANSI utf8 and utf8 conversion ANSI is the combination of the above two. Unicode is used as the intermediate amount and can be converted twice.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.