Thinking Logic of computer program (7)-How to recover from garbled characters (bottom)?

Last Update:2017-01-06 Source: Internet

Author: User

Tags windows 1252 ultraedit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Garbled

The last section of the main reason for garbled appearance, that is, in the encoding conversion, if the original code is identified wrong, and the conversion, it will be garbled, and this time no matter how to switch the way to view the encoding, is not possible.

Let's take a look at this error after the conversion of garbled, or with the example of the previous section, the binary is (16 binary representation): C3 C3 8F C3 C3 AD, regardless of which encoding parsing appears to be garbled:

UTF-8	À?? í
Windows-1252	? €????? -
GB18030	Distinction Contacto advertised riveting
Big5	??? Stable

Although there are so many forms, but we see the garbled form is likely to be "à??" Í ", because in the example UTF-8 is the encoding conversion of the target encoding format, since the conversion in order to UTF-8, is generally to be viewed as UTF-8.

garbled recovery

"Chaos" is mainly because of a wrong encoding conversion, recovery is to restore two key information, one is the original binary encoding mode A, and the other is the wrong interpretation of the encoding method B.

The basic idea of recovery is to try to reverse the operation, assuming that a coded conversion mode B to get garbled binary format, and then assume that an encoding interpretation of the binary, look at the way it looks, this to try a variety of coding, if you can find the normal character form, it should be able to recover.

This may sound fuzzy, let us give an example to illustrate that, assuming that the garbled form is "à??" Í ", try a variety of B and a to see the character form. Let's start with the editor, take UltraEdit as an example, and then use Java programming to see it.

Using UltraEdit

UltraEdit support encoding conversion and switching view encoding, but also support binary display and editing of files, so we take ultraedit as an example, some other editors may also have similar functionality.

Create a new UTF-8 encoded file, copy "à?? Í "into the file. Convert to windows-1252 encoding using the encoding transform, the function is "file", "Convert to", "Western Europe"->win-1252.
After the conversion, open the hexadecimal editor and look at the binary form as shown in:

Can be seen, its form or à?? í, but the binary format became C0 CF C2 ED. This process is equivalent to assuming that B is windows-1252. This time, in accordance with a variety of encoding format to view the binary, in UltraEdit, turn off 16 binary editing, switch to view the encoding as GB18030, features in the "View", "View Way (File encoding)", "East Asian language"->gb18030, After switching, the same binary magically becomes the correct character form "Old horse", open hex Editor, you can see, binary or C0 CF C2 ED, this GB18030 equivalent to assume a is GB18030.

We happened to be right the first time in this example. In practice, we may want to do several attempts, the process is similar, the first encoding conversion (using B-code), and then use a different encoding method to view (using a-coded), if you can find the appearance of the right form, it is restored. Lists the major B-coding formats, corresponding to the binary, according to a-coded interpretation of various forms.

As can be seen, the first line is correct, that is, the original code is actually a is GB18030, but was misinterpreted as B that is Windows-1252.

Using Java

We still have a lot of knowledge about using Java, but some readers already have good knowledge of Java, so this article lists the relevant code, the beginner does not understand that we will further explain later.

The classes in Java that handle strings have two important methods that we need in string,string:

Public byte[] GetBytes (String charsetname), this method can get a string in the binary form of a given encoding format
public string (byte bytes[], String charsetname), which is interpreted as a string in the encoded format CharsetName the given binary array bytes.

To see a as gb18030,b as Windows-1252, the Java code for recovery is as follows:

String str = "à??" Í "new String (Str.getbytes (" windows-1252 ")," GB18030 "); System.out.println (NEWSTR);

The binary of the string is obtained by the B-code (WINDOWS-1252), then the binary is interpreted by a-coded (GB18030), a new string is obtained, and the output of the string is exported as "old horse".

Again, this one happens right, in practice, we can write a loop that tests the result form in a different A/b encoding, as shown in the following code:

PublicStaticvoidRecover (String str)throws unsupportedencodingexception{string[] charsets = new string[]{"windows-1252", "GB18030", "Big5", " UTF-8 "}; For (int i=0;i<charsets.length;i++) {for (int j=0;j<charsets.length;j++) { if (i!= j) {string s = new string (Str.getbytes (Charsets[i]), charsets[j]); SYSTEM.OUT.PRINTLN ("----Original code (a) hypothesis is:" +charsets[j]+ ", was misinterpreted as (B):" +charsets[i]); System.out.println (s); System.out.println (); } } }}

The above code is tested using a different encoding format and can be restored if the output is correct.

Resumption of discussion

As can be seen, this attempt needs to be done many times, the above example attempts a common coding gb18030/windows 1252/big5/utf-8 a total of 12 combinations. These four encodings are common encodings and should be sufficient in most practical applications, but if your situation has other encodings, you can add a few attempts.

Not all garbled forms can be restored, if there are many unrecognized characters in the form such as??, it is difficult to recover, in addition, if garbled is due to the multiple parsing and conversion errors caused by, it is difficult to recover.

Summary

The previous section and this section describe the knowledge of coding, the causes of garbled characters and the methods of recovery, which are language-independent.

Next, it's time to look at how the characters are represented and handled in Java, and we know that Java uses a char type to represent a character, but in the third section we mention the question of "how can a character type perform arithmetic and comparison?".

We need to have a clearer and deeper understanding of the character types in Java.

Thinking Logic of computer program (7)-How to recover from garbled characters (bottom)?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More