One of the explorations on solving garbled problems (involving Utf-8 and GBK)

Source: Internet
Author: User

When you use Visual Studio 2005 for MFC development, you find that automatically added comments become garbled. Like this:

// TODO: Ú′?ìí?óx¨ó?′ú?? Oí/?òμ÷ó?? Ùàà
And here's the thing:
// TODO: Ú′?ìí?ó??? ¢′|àí3ìdò′ú?? Oí/?òμ÷ó??? È?? μ

The correct display of them should be

// TODO: Add private code here and/or call base class
And
// TODO: Add Message Handler code and/or call default values here

When saved, a dialog box appears:

There are a variety of tutorials on the web, including what sets "automatic identification of utf-8 without signatures". So consider your own solution. Here is my exploration process:

One, save the file first, save the file in the form "Unicode (UTF-8 Signed) code page: 65001" (Signed UTF-8 refers to the UTF-8 with the BOM, as for the UTF-8 with BOM and without BOM, please poke this). Such as:

Two, view the file's 16 code (that is, to see what data the file actually saved)

Use the Winhex software to open the file you just saved (and, of course, use UltraEdit) to view the file's 16 binary code. We found the garbled place, and its 16 code to find out, as follows:

In order to demonstrate more clearly, I will be garbled separately copied out, it is important to note that the text is saved to UTF-8 (preferably with a BOM, if you use the Windows-brought text editor to bring your own BOM) saved to this:

The 16 binary code for the file is:

The first three bytes "EF BB BF" is the BOM tag described earlier, starting with the fourth byte, which is the actual content of the file. After observation, it is found that in the actual content part, the odd bit is not C2 is C3, even the number of bits of the law, at the same time, we find the exact words "here to add special code and/or call the base class" corresponding to the GBK encoded values, for comparison.

Table One: 16 binary data in garbled files:

0xc3    0x94    0xc3    0x9a    0xc2    0xb4    0xc3    0x8b    0xc3    0x8c    0xc3    0xad    0xc2    0XBC    0xc3    0x93    0xc3    0x97    0xc2    0xa8    0xc3    0x93    0xc3    0x83    0xc2    0xb4    0xc3    0xba    0xc3    0x82    0xc3    0xAB    0xc2    0xba    0xc3    0x8d    0x2f0xc2    0XBB    0xc3    0xb2    0xc2    0xb5    0xc3    0xb7    0xc3    0x93    0xc3    0x83    0xc2    0XBB    0xc3    0xb9    0xc3    0x80    0xc3    0xa0

Table Two: "Add private code here and/or call base class" corresponding to the GBK encoded value, each character (kanji or/) corresponds to one line:

0xd4    0xDA    0xb4    0XCB    0XCC    0xed    0XBC    0xd3    0xd7    0xa8    0xd3    0xc3    0xb4    0XFA    0xc2    0xeb    0xba    0xCD    0x2f0XBB    0xf2    0xb5    0xf7    0xd3    0xc3    0XBB    0xf9    0xc0    0xe0
4, analysis

Looking closely at the data from the two tables above, it is not difficult to find the following rules:

1, the value of the C2 and C3 of the odd position (except '/') in each row of the table is removed, and the remaining value is similar to the data in table two.

2, in addition to the "/" row, the odd position in each row of the table is C2, the subsequent even digits and the corresponding bit in table two (table one of the second column corresponding table two the first column, table A fourth column corresponds to the second column in Table II) is the same, table one each row of the odd digit is C3, The subsequent even digits plus the 16 binary number 0x40 are the same as the corresponding bits in table two (the corresponding law is the same as the previous one).

3, because the garbled file is stored in Utf-8, but after the conversion to get the encoding for GBK, we can roughly know that the reason for garbled is that Visual Studio 2005 mixed the two encodings, this should be considered a bug it. After all, Visual Studio 2013 never came across.

5, solve the problem

According to the above rule, we can fix the problem by using binary mode to read the UTF-8 format encoded file data and then output it to GBK encoded file.

According to the above rules to write a simple C language program:

#include <stdio.h>#include<stdlib.h>intMainintargcChar Const*argv[]) {FILE*FP; FILE*FP2; //Open stored garbled file, utf-8 format, binary open    if((Fp2=fopen ("BadCode.txt","rb+"))==NULL) {printf (the Open Source Filefailed!\n"); System ("Pause"); Exit (1); }    //Open, new file to store processed data    if((Fp=fopen ("BadCodeH.txt","w+"))==NULL) {printf (" open/create Destination File  failed! \ n"); System ("Pause"); Exit (1); }    //record odd digit (high) dataunsigned ch; //Record even digit (low) dataunsigned cl; //Get the dataCh=fgetc (FP2); //determine the format of a file, Utf-8 or Unicode, and skip BOM characters    if(ch==0xEF) {fgetc (FP2);        Fgetc (FP2); CH=fgetc (FP2); }    Else if(ch==0xFF) {fgetc (FP2); CH=fgetc (FP2); }    //not up to the end     while(!feof (FP2)) {        //ASCII characters, normal output        if(ch<=0x7f) {FPUTC (CH,FP); }        //The odd digits are 0xc3, and the output is 0x40 after getting even digits.        Else if(ch==0xc3) {CL=fgetc (FP2); CL+=0x40;        FPUTC (CL,FP); }        //The odd digit is 0xc2, and the direct output is obtained after the even digit.        Else if(ch==0xc2) {CL=fgetc (FP2);        FPUTC (CL,FP); }        //other cases, direct output        Else{FPUTC (CH,FP); }        //get the next dataCh=fgetc (FP2);    } fclose (FP);    Fclose (FP2); System ("Pause"); return 0;}

Operation instance results such as:

6, more general situation (both the correct Chinese characters and garbled)

We must note that the C language program above is only suitable for one case: the garbled document format is utf-8 and the document only has Chinese garbled characters and ASCII characters. But most of the time we are the source code in both the correct Chinese character and garbled characters, then the above program is invalid, because we need to convert the correct Chinese characters utf-8 encoding to GBK encoding can be. We tried to modify the above code to solve the problem.

For both garbled and normal characters of the file, as long as the correct Chinese characters in the Utf-8 encoding to GBK encoding to solve the problem, so the key problem is to establish a utf-8 and GBK code conversion table. Baidu, we can easily find this table, and then, wrote the following program, wherein the UnicodeToGBK.txt file is: http://pan.baidu.com/s/1gdCkWSj:

One of the explorations on solving garbled problems (involving Utf-8 and GBK)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.