When you use Visual Studio 2005 for MFC development, you find that automatically added comments become garbled. Like this:
// TODO: Ú′?ìí?óx¨ó?′ú?? Oí/?òμ÷ó?? Ùàà
And here's the thing:
// TODO: Ú′?ìí?ó??? ¢′|àí3ìdò′ú?? Oí/?òμ÷ó??? È?? μ
The correct display of them should be
// TODO: Add private code here and/or call base class
And
// TODO: Add Message Handler code and/or call default values here
When saved, a dialog box appears:
There are a variety of tutorials on the web, including what sets "automatic identification of utf-8 without signatures". So consider your own solution. Here is my exploration process:
One, save the file first, save the file in the form "Unicode (UTF-8 Signed) code page: 65001" (Signed UTF-8 refers to the UTF-8 with the BOM, as for the UTF-8 with BOM and without BOM, please poke this). Such as:
Two, view the file's 16 code (that is, to see what data the file actually saved)
Use the Winhex software to open the file you just saved (and, of course, use UltraEdit) to view the file's 16 binary code. We found the garbled place, and its 16 code to find out, as follows:
In order to demonstrate more clearly, I will be garbled separately copied out, it is important to note that the text is saved to UTF-8 (preferably with a BOM, if you use the Windows-brought text editor to bring your own BOM) saved to this:
The 16 binary code for the file is:
The first three bytes "EF BB BF" is the BOM tag described earlier, starting with the fourth byte, which is the actual content of the file. After observation, it is found that in the actual content part, the odd bit is not C2 is C3, even the number of bits of the law, at the same time, we find the exact words "here to add special code and/or call the base class" corresponding to the GBK encoded values, for comparison.
Table One: 16 binary data in garbled files:
0xc3 0x94 0xc3 0x9a 0xc2 0xb4 0xc3 0x8b 0xc3 0x8c 0xc3 0xad 0xc2 0XBC 0xc3 0x93 0xc3 0x97 0xc2 0xa8 0xc3 0x93 0xc3 0x83 0xc2 0xb4 0xc3 0xba 0xc3 0x82 0xc3 0xAB 0xc2 0xba 0xc3 0x8d 0x2f0xc2 0XBB 0xc3 0xb2 0xc2 0xb5 0xc3 0xb7 0xc3 0x93 0xc3 0x83 0xc2 0XBB 0xc3 0xb9 0xc3 0x80 0xc3 0xa0
Table Two: "Add private code here and/or call base class" corresponding to the GBK encoded value, each character (kanji or/) corresponds to one line:
0xd4 0xDA 0xb4 0XCB 0XCC 0xed 0XBC 0xd3 0xd7 0xa8 0xd3 0xc3 0xb4 0XFA 0xc2 0xeb 0xba 0xCD 0x2f0XBB 0xf2 0xb5 0xf7 0xd3 0xc3 0XBB 0xf9 0xc0 0xe0
4, analysis
Looking closely at the data from the two tables above, it is not difficult to find the following rules:
1, the value of the C2 and C3 of the odd position (except '/') in each row of the table is removed, and the remaining value is similar to the data in table two.
2, in addition to the "/" row, the odd position in each row of the table is C2, the subsequent even digits and the corresponding bit in table two (table one of the second column corresponding table two the first column, table A fourth column corresponds to the second column in Table II) is the same, table one each row of the odd digit is C3, The subsequent even digits plus the 16 binary number 0x40 are the same as the corresponding bits in table two (the corresponding law is the same as the previous one).
3, because the garbled file is stored in Utf-8, but after the conversion to get the encoding for GBK, we can roughly know that the reason for garbled is that Visual Studio 2005 mixed the two encodings, this should be considered a bug it. After all, Visual Studio 2013 never came across.
5, solve the problem
According to the above rule, we can fix the problem by using binary mode to read the UTF-8 format encoded file data and then output it to GBK encoded file.
According to the above rules to write a simple C language program:
#include <stdio.h>#include<stdlib.h>intMainintargcChar Const*argv[]) {FILE*FP; FILE*FP2; //Open stored garbled file, utf-8 format, binary open if((Fp2=fopen ("BadCode.txt","rb+"))==NULL) {printf (the Open Source Filefailed!\n"); System ("Pause"); Exit (1); } //Open, new file to store processed data if((Fp=fopen ("BadCodeH.txt","w+"))==NULL) {printf (" open/create Destination File failed! \ n"); System ("Pause"); Exit (1); } //record odd digit (high) dataunsigned ch; //Record even digit (low) dataunsigned cl; //Get the dataCh=fgetc (FP2); //determine the format of a file, Utf-8 or Unicode, and skip BOM characters if(ch==0xEF) {fgetc (FP2); Fgetc (FP2); CH=fgetc (FP2); } Else if(ch==0xFF) {fgetc (FP2); CH=fgetc (FP2); } //not up to the end while(!feof (FP2)) { //ASCII characters, normal output if(ch<=0x7f) {FPUTC (CH,FP); } //The odd digits are 0xc3, and the output is 0x40 after getting even digits. Else if(ch==0xc3) {CL=fgetc (FP2); CL+=0x40; FPUTC (CL,FP); } //The odd digit is 0xc2, and the direct output is obtained after the even digit. Else if(ch==0xc2) {CL=fgetc (FP2); FPUTC (CL,FP); } //other cases, direct output Else{FPUTC (CH,FP); } //get the next dataCh=fgetc (FP2); } fclose (FP); Fclose (FP2); System ("Pause"); return 0;}
Operation instance results such as:
6, more general situation (both the correct Chinese characters and garbled)
We must note that the C language program above is only suitable for one case: the garbled document format is utf-8 and the document only has Chinese garbled characters and ASCII characters. But most of the time we are the source code in both the correct Chinese character and garbled characters, then the above program is invalid, because we need to convert the correct Chinese characters utf-8 encoding to GBK encoding can be. We tried to modify the above code to solve the problem.
For both garbled and normal characters of the file, as long as the correct Chinese characters in the Utf-8 encoding to GBK encoding to solve the problem, so the key problem is to establish a utf-8 and GBK code conversion table. Baidu, we can easily find this table, and then, wrote the following program, wherein the UnicodeToGBK.txt file is: http://pan.baidu.com/s/1gdCkWSj:
One of the explorations on solving garbled problems (involving Utf-8 and GBK)