definition of encoding and decoding:
Computers can only handle 0100110 of such binary digits, characters are the symbols we use in daily life, in order for the computer to be able to store, transmit and display characters, we need to convert the characters to 0100101 such binary code, which is the code.
Instead, the process of converting 0100110 of such binary codes into characters is decoding.
As to which character map to which binary string, is determined by the state (National standards), international organizations (standards) and so on.
Instead of using binary strings to represent the encoding of characters (both read and read are difficult), a hexadecimal string is used to represent the encoding of a character. Therefore, the set of mappings between characters <--> hexadecimal strings is what we call the character set.
The character sets we often see are: Unicode,utf-8,gbk,gb2312,gb18030,big-5,iso-8859-1 (also called Latin-1)
Unicode,utf-8 can support the characters of all languages in the world at present
GB2312 is the old national standard and does not support traditional Chinese characters
GBK is not a national standard, but supports traditional Chinese characters
GB18030 is the new national standard, also supports the traditional Chinese characters
Big-5, can support traditional Chinese characters
Iso-8859-1 does not support Chinese characters, Anglo-American and other Latin language countries commonly used this code
China's two characters, encoded in a different way, the results are not the same:
Unicode-fe, FF, 4E, 2d, FD
UTF-8-E4, B8, AD, E5, 9B, BD
Gb18030-d6, D0, B9, FA
Iso-8859-1-3F, 3F
about Unicode:
"China" encoded in Unicode, preceded by a fe,ff of two additional bytes, which are called BOM (byte order Mark). "Medium" The Chinese character, its Unicode encoding is 4e,2d, then in the process of transmission, is the 4E in front, or 2D put in front of it. This has a BOM to determine if the BOM is Feff (called Big endian), indicating 4E in front, if the BOM is Fffe (called Little endian), indicating 2D in front
That is, if your file encoding is Unicode (or UTF-16), then the byte at the beginning of the file is the Feff (in this way, the code called UTF-16BE) or Feff (called Utf-16le in this way). Utf-16be, UTF-16, and Unicode are the same.
about UTF-8:
UTF-8 encoded file, its file also has a section logo: EF BB BF
Other encodings:
Other encoded files. Its file header does not have any marking, regarding iso-8859-1 to "China" The code is 3f,3f, two identical, is should for its character set does not have "China" These two words, therefore uses the 3F to replace.
coding and decoding problems in Java:
Java can store characters in memory (that is, it is also stored in memory in 01101110). The encoding format that Java stores in memory is Unicode. However, the Java class file is encoded in a UTF-8 way. After reading the class file, the Java Virtual machine reads the file into memory via UTF-8 encoding and converts it to UTF-16 encoding. Therefore, if the statement is a new String (byte[], "GB18030"), is to convert a byte stream (encoded in GB18030) to a Unicode encoded byte stream into memory.
If you want to convert a stream of bytes into character streams, you need to know how the byte stream is encoded, and then decode it according to his encoding. Open Source tools: Chardet, which can be used to guess what code a section of a byte is encoded in.
If we get a string, the output is found to be a string of garbled, then whether it can be a technical means to solve this garbled problem. The answer is: not necessarily.
Because, first of all, the string must be based on a character set to decode a byte stream. So:
If it is in accordance with the iso-8859-1 character set to decode the byte stream to get the string, congratulations, you can by some means, the hands of the garbled converted to the correct string. The method is to GetBytes ("iso-8859-1"), get a byte stream, and then convert the new string (byte[], "correct character set") to the correct string.
If it is not decoded according to the iso-8859-1 character set, it is likely that it will not be converted to the correct string.
to submit a request in the background through the Get method:
in requests submitted to the server, the characters are also transmitted after encoding. The browser automatically encodes the Chinese characters in the URL address and transmits them. If you enter Chinese characters directly in the URL address, the browser will encode according to the default character set of the current character (Chinese operating system is GB2312) and then transfer to the server; If you click on a page with a Chinese connection, then the browser will encode the characters based on the character set used on the page. Transfer to the server.
To avoid each coding problem, we can modify his default decoding character set for URL encoding directly in Tomcat. is to modify the following statement in the Server.xml file, plus the configuration of the uriencoding= "GB18030":
<connector port= "8080" protocol= "http/1.1"
Connection timeout= "20000"
Redirectport= "8443" uriencoding= "GB18030"/>
submit a request to the background through the Post method:
The browser encodes the data according to the character set used by the page on which the form resides. Configuring Uriencoding in Tomcat is only valid for GET requests, and for post requests, the request.setcharacterencoding ("Request.getparameter" must be called before calling the GB18030 ");
Reponse Code:
Call Response.setcharacterencoding ("GB18030") before calling Response.getwrite (), or in the execution Response.setcontenttype ("text/html; charset=gb18030 ");