Multibyte and wide characters
C + + in string/char*, wstring/wchar_t*
C + + Test
Below window
char* cName = "Beijing";//multi-byte convert to wide character string! unsigned short wsname[50] = {0};int Widecharcount = MultiByteToWideChar (CP_ACP, 0, (LPSTR) CName,-1, NULL, 0)-1; MultiByteToWideChar (CP_ACP, 0, (LPSTR) CName,-1, (LPWSTR) Wsname, Widecharcount + 1); for (int i=0; i<widecharcount; i++ {printf ("%d", Wsname[i]);} printf ("\ n");
Output
21271 20140 24066
Linux below
Test code such as the following:
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <locale.h> #include < Iostream> #include <string>using namespace std;void multibyte_to_widechar_test (); void Read_file (const char* fname); void Dump_uchar (unsigned char ch); int main () {multibyte_to_widechar_test (); Read_file ("CHS"); printf ("Any key pressed to exit...\n"); GetChar (); return 0;} void Multibyte_to_widechar_test () {typedef string str_t; str_t Cur_loc = setlocale (Lc_all, NULL); printf ("Cur_locale =%s\n", Cur_loc.c_str ()); SetLocale (Lc_all, "ZH_CN. GBK "); Char mb_buf[100]; strcpy (Mb_buf, "Beijing"); int mbstr_len = strlen (MB_BUF); wchar_t* wcstr = NULL; int wcstr_len = MBSTOWCS (wcstr, mb_buf, 0) + 1; printf ("Mb_len =%d, Wc_len =%d\n", Mbstr_len, Wcstr_len); WCSTR = new Wchar_t[wcstr_len]; int ret = MBSTOWCS (wcstr, Mb_buf, Mbstr_len); if (ret <= 0) {printf ("Conversion failed \ n"); } else {PrinTF ("Conversion succeeded \ n"); wsprintf (L "%ls\n", wcstr); printf ("View1 =====\n"); for (int i=0; i<wcstr_len-1; i++) {int code = (int) wcstr[i]; printf ("%d\t", code); } printf ("\ n"); printf ("View2 =====\n"); for (int i=0; i<wcstr_len-1; i++) {int code = (int) wcstr[i]; Dump_uchar ((unsigned char) (code/256)); Dump_uchar ((unsigned char) (code%256)); } printf ("\ n"); } setlocale (Lc_all, Cur_loc.c_str ());} void Dump_uchar (unsigned char ch) {Const char* str = "0123456789abcdef"; printf ("0x%c%c\t", STR[CH/16], str[ch%16]);} void Read_file (const char* fname) {file* fp = fopen (fname, "R"); if (!FP) {return; } printf ("===============\n"); Char buffer[100] = {0}; Fgets (buffer, +, FP); printf ("%s", buffer); printf ("View1 =========== \ n"); int len = strlen (buffer)-1; for (int i=0; i<len;i++) {Dump_uchar ((unsigned char) buffer[i]); }printf ("\ n"); printf ("View2 =========== \ n"); for (int i=0; i<len; i+=2) {unsigned char-down = (unsigned char) buffer[i]; unsigned char high = (unsigned char) buffer[i+1]; printf ("%d", (high<<8) |down); } printf ("\ n"); Fclose (FP);}The Multibyte_to_widechar_test function converts multibyte encoding into Unicode encoding. Then output the Unicode string contents. Read_file attempts to read the string encoded content in the file.
CHS is directly generated via VI, with the content "Beijing", and the/base_profile set up for example the following:
Export Lc_all= "ZH_CN. GBK "
So the code for the CHS file is GBK by default.
g++ Test.cpp-o App_test, then executes the output:
[Email protected]:~/peteryfren/cpp/encode_app>./app_test Cur_locale = Cmb_len = 6, Wc_len = 4 conversion succeeded View1 =====21271
20140 24066view2 =====0x53 0x17 0x4e 0xac 0x5e 0x02=============== Beijing View1 =========== 0xb1 0xb1 0xbe 0xa9 0xca 0xd0view2 =========== 45489 43454 53450 any key pressed to exit ...
The Unicode encoding value of "Beijing" is consistent with the output on window. The gbk2312 Code of "Beijing" is 45489,43454,53450. At the same time, Linux VI created a file encoded as GBK, consistent with the settings in Base_profile.
Convert utf-8 encoded files to Unicode by Iconv under BTW Linux:
Iconv-f UTF-8-T GBK test.txt-o pp.txt
python2.7 Test
>>> s = U ' Beijing ' >>> su ' \u5317\u4eac\u5e02 ' >>> gbks = ' Beijing ' >>> gbks ' \xb1\xb1\xbe\xa9\ Xca\xd0 ' >>> s.encode (' utf-8 ') ' \xe5\x8c\x97\xe4\xba\xac\xe5\xb8\x82 '
2.7 The following plus U represents Unicode encoding, without u using GBK encoding. python3.3 below cannot output the byte code of the string, >>s equivalent to, >>print (s)
Windows text encoding verification 1. ANSI uses Windows-brought Notepad to create a default TXT, called Npd.txt Open with UE, 16 in the binary view:
In this case, the Chinese code in the file is gbk2312 encoded. Consistent with file encoding output on Linux.
2. Unicode Notepad opens Npd.txt, then save as, you can see the encoding is ANSI, select Unicode, Save as Npd_u.txt
Unicode encoding that matches the output on Windows and Linux above.
3,utf-8 same open Npd.txt, save As, encode select Utf-8, Save as Npd_utf8.txt
The utf-8 output is consistent with the experiment in Python, which is certain.
study on the problem of string coding http://blog.csdn.net/ryfdizuo/article/details/17324051
GB18030 and the usual gdk are extensions to the gb2312, and all that has been included in the gb2312 remains the same. References
1. http://blog.csdn.net/xiaobai1593/article/details/7063535
2. GBK2312 encoding table see: http://ff.163.com/newflyff/gbk-list/
3. Unicode encoding table see: http://jlqzs.blog.163.com/blog/static/2125298320070101826277/
Coding problem Learning "2"