Encoding and decoding

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The computer can only process binary numbers such as 011001. characters are the symbols we use in daily life. In order for the computer to store, transmit, and display characters, we need to convert the character to a binary code like 0110000. This is the so-called encoding. On the contrary, the process of converting a binary code like 011000 to a character is decoding! In Java, char represents a character, and string represents a string!

Which character is mapped to which binary string is determined by the country (National Standards) and international organizations (International Standards!

Generally, binary strings are not used to represent the encoding of a character (because it is difficult to write and read). Therefore, hexadecimal strings are generally used to represent the encoding of a character.

Therefore:
Character <--> hexadecimal string, a set of ing (because there are many characters), constitutes the character set.
Character sets you often see are Unicode, UTF-8, gb2312, gb18030, GBK, big-5, ISO-8859-1 (also known as: Latin-1)
Unicode, UTF-8 can support all the languages in the world character gb2312 is the old national standard, does not support traditional Chinese GBK is not the national standard, but supports traditional Chinese character gb18030 is the new national standard, it also supports traditional Chinese characters big-5 and traditional Chinese characters
ISO-8859-1, does not support Chinese characters, the United States and other Latins are commonly used in this encoding
Example:
Public class encodingtest01 {public static void main (string [] ARGs) throws exception {string S = "China ";
Byte [] BS = S. getbytes ("gb18030 ");
System. Out. println (utils. bytearraytohex (BS ));
}
}
Public class utils {Private Static char [] hexdigits = {'0', '1', '2', '3', '4', '5', '6 ', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'
};
Public static string bytearraytohex (byte [] BS) {stringbuffer sb = new stringbuffer ();

For (byte B: BS ){

SB. append (hexdigits [(B> 4) & 0x0000000f]); sb. append (hexdigits [B & 0x0000000f]); sb. append (",");
}
Return sb. tostring ();
}
}

"China", these two simplified Chinese characters, in different character sets, are output using the above program, the encoding is Unicode-Fe, FF, 4E, 2d, 56, FD, UTF-8-E4, B8, AD, E5, 9B, BD
Gb18030-D6, D0, B9, fa, ISO-8859-1-3f, 3f, explanation: About Unicode
"China" after unicode encoding. The encoding is preceded by two additional Bytes: Fe and ff. These two bytes are called
BOM (byte order mark ). The Unicode encoding of the Chinese character "medium" is 4E 4E and 2D. So in the transmission process, do we put 4E in front or 2D in front? This is determined by BOM. If Bom is feff (called Big endian), it indicates that 4E is in the front; if Bom is fffe (called little endian), it indicates that 2D is in the front.
That is to say, if your file encoding is Unicode (or UTF-16), then the byte starting with the file is: feff (encoding starting with this method is called UTF-16BE) or fffe (encoding starting with this method is called UTF-16LE)

UTF-16BE, UTF-16, Unicode are the same!
About UTF-8
A UTF-8-encoded file with a header that also has a logo: EF BB BF
For other encoded files, the file header has no identifier !!!

After encoding the word "China" using the ISO-8859-1 encoding method, we get 3f 3f, because the root character set in the ISO-8859-1 does not have the two characters "China! So it is replaced by 3f!
Decoding
Decoding is actually to convert a binary stream (that is, a byte stream, that is, a byte array) to a character. You must use the same character set to decode a byte array!
Public class encodingtest02 {public static void main (string [] ARGs) throws exception {

Byte [] BS = {(byte) 0xd6, (byte) 0xd0, (byte) 0xb9, (byte) 0xfa };

String S = new string (BS, "gb18030"); system. Out. println (s );
}
}

As shown in the preceding procedure, we know that d6d0 b9fa is the "gb18030" encoding of the two characters "China". Therefore, you can use this character set to decode the corresponding byte array as a character!
Characters in Java
The first thing you need to understand is that Java can store characters in the memory (that is, 0110010 is stored in the memory )! So what encoding does Java use when storing characters in memory? The answer is Unicode.
Therefore, we know that the two Chinese characters "China" exist in the memory in the form of "4E, 2d, 56, FD!

But please note that the Java class file is coded in UTF-8! When the Java Virtual Machine reads the class file, the class file is read into the memory through the UTF-8 encoding, and converted to the UTF-16 (UNICODE ).

Therefore, the true meaning of new string (byte [], "gb18030") is that byte streams (which are encoded in gb18030 mode) into Unicode are stored in memory!
4e, 2d, 56, FD
China
E4, B8, AD, E5, 9B, BD
D6, D0, B9, Fa
3f, 3f
Getbytes ("ISO-8859-1 ")
Getbytes ("gb18030 ")
Getbytes ("UTF-8 ")

As shown in, assume that there are two Chinese characters: "China", and they can be converted to different encodings through getbytes ("encoding!
Let's look at it again. Suppose there is a byte stream, and now we want to convert it into a string:

E4, B8, AD, E5, 9B, bd4e, 2d, 56, FD
China
New String (byte [], "UTF-8 ")
E4, B8, AD, E5, 9B, BD
New String (byte [], "ISO-8859-1 ")
Getbytes ("ISO-8859-1 ")
// "China" two characters of the UTF-8 code E4, B8, AD, E5, 9B, BD
Byte [] BS = {(byte) 0xe4, (byte) 0xb8, (byte) 0xad, (byte) 0xe5, (byte) 0x9b, (byte) 0xbd };
String S = new string (BS, "ISO-8859-1"); // The length is 6system. Out. println (s); // get garbled characters
Byte [] newbs = S. getbytes ("ISO-8859-1 ");
String newstring = new string (newbs, "UTF-8"); system. Out. println (newstring); // correct
A byte stream originally coded by the UTF-8, If you convert it as a ISO-8859-1, The ISO-8859-1 rules are one character per byte, so get a string of 6 in length, if you output it, it will be garbled! Now you can reverse conversion and get its byte stream through ISO-8859-1 encoding, which will remain intact to get a byte stream, and then you can correctly decode this byte stream!
Summary
If you get a byte stream, you need to know clearly which character set is used for encoding your byte stream! Open-source tool: chardet, which can be used to guess the encoding of a certain byte stream!

If you get a string and you find it is garbled during output, can you solve this garbled problem by some technical means? The answer is: not necessarily! Because you get a string, you can be sure that the string you get is because someone puts it according to a certain
Character Set decodes the byte stream!
A) if someone is a string that decodes byte streams based on the ISO-8859-1 character set
Hi, you will be able to use some technical means to convert the garbled strings in your hands into the correct strings! This technical means is: first getbytes ("ISO-8859-1") to get a byte stream, then the byte stream with new string (byte [], "correct Character Set ") convert to the correct string!
B) if someone does not decode the byte stream based on the ISO-8859-1 Character Set, unfortunately
The big possibility is that you have no way to restore it correctly!
Page code
Requests submitted to the background using the get Method
In the request submitted to the server, the characters also need to be encoded and transmitted! The browser will automatically encode and transmit the Chinese characters in the URL address. If you enter the URL address and Chinese characters in the address bar of your browser

It will automatically use the default Character Set of the current operating system for encoding (the Chinese operating system is gb2312) and then transmit it to the server; if you click a link on a page (Chinese characters are attached to the link), the browser will automatically encode the Chinese Characters Based on the character set used on the page, transfer to server!

For example, if the Page code is gb18030 (or GBK or gb2312) for two Chinese characters "China", the string transmitted backward is % D6 % D0 % B9 % fa.

For example: http: // localhost: 8080/myservlet/adduser? Username = % D6 % D0 % B9 % fa
We use request in the background. getparameter ("username") to get a string. The result is garbled, so you can be sure that someone will help us decodes the byte stream transmitted from the client into a string according to a character set! Someone is the app server (Tomcat), a character set is the app server's default character set, is the ISO-8859-1. So, we're lucky to be saved!
String username = request. getparameter ("username ");
// Convert it to the correct string!
Byte [] BS = username. getbytes ("ISO-8859-1"); username = new string (BS, "gb18030"); // correct

System. Out. println (username );

Of course, to avoid doing this every time, we can simply modify Tomcat's default decoded character set for URL encoding!
Modify the following statement in the server. xml file and add the uriencoding = "gb18030" configuration: <connector Port = "8080" protocol = "HTTP/1.1"
Connectiontimeout = "20000"
Redirectport = "8443" uriencoding = "gb18030"/>
Requests submitted to the background through post
The browser will encode the data according to the character set used on the page where the form is located and then transmit it!

The uriencoding configured in Tomcat is only valid for data in requests submitted in get mode. For post requests, you must call the request. before getparameter ("XXX"), call: request. setcharacterencoding ("gb18030 ");!
Response Encoding

Call response. setcharacterencoding ("gb18030") before calling response. getwriter ");

Response uses iso_8859-1 encoding and decoding by default .. Chinese characters are garbled. Now we can use gb18030 to avoid garbled characters.

Http://wenku.baidu.com/view/14a91366caaedd3383c4d331.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Encoding and decoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Encoding and decoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support