What is a character and what is a byte?
It can be understood that the computer has no character concept, only bytes. Characters are concepts that exist in the human language layer and are used for communication between people because bytes are not readable to humans, but computers store all of the data in bytes.
So to store the character of human consciousness in the computer, you must convert the character to byte data, then how to transform it, you must have a mapping rule, where the mapping rule is the usual meaning of the character encoding, such as the file is GBK code, can be said: The character data in this document converts characters to byte storage according to the GBK character byte mapping rule.
So all the characters that are stored in human consciousness in a computer or need to be passed through a computer involve a transformation of a mapping rule between characters and bytes.
Converting characters to bytes according to the mapping rules is called encoding, and vice versa is called decoding.
Figure out what the characters and bytes are, and why you want to encode the decoding, and see where you need to encode in Java:
1:java source code files. When we use the editor to edit data, we need to select a mapping rule to store in the computer when we want to store characters in the editor. When the Java source code is compiled, Javac reads the source code file, gets the byte data of the source code file, and needs to convert it to a character, then it must be the same as the mapping rule that was selected when the source code was just stored.
Mapping rules can correctly revert these bytes to characters in human consciousness, and then map these characters using utf-16 mapping rules to bytes stored in. class. So when compiling Java source code, you must specify the character byte mapping rule (encoding) of the source code, and if you specify wrong, the mapped character will be faulted.
Note: Javac reads the source code file by default with the encoding of the local compilation platform. So if some people in the development team is a Japanese system, some Chinese system, but there is no unified source code, upload to CVS, and later compiled on the Utf-8, hehe, all messed up.
2: Console encoding: When we use SYSTEM.OUT.PRINTLN () in Java code and output characters, we pass data to the operating system for the operating system to display again. There are also encodings in the middle. Remember: all the places involving characters involve coding. Because the underlying is passed through bytes, you must select a character byte mapping rule to pass through byte.
Imagine that in the Java side we use a mapping to map characters to bytes, send these bytes to the operating system, get byte data from the operating system, and then use some mapping to map the bytes back to the characters. If the mapping method is not the same will produce unpredictable data, not the results of the user hope, this is garbled
and Java implementation is the local operating system to map the default mapping of characters to bytes, the operating system is also using the default to map the byte back characters, So there is no error in this process. So what we see in the console is garbled because in Java memory or the word characters this time this character is not what you want to see.
As the source code above says, if the encoding error is specified when compiling, the source code byte is mapped to a character and then an error is made, then the wrong character is utf-16 mapped for byte, and then the byte is mapped to a character when run, and the character is already wrong ( This character is a character that was originally mapped by the utf-16 mapping rule, but later followed by the GBK mapping
rules. And the console appears garbled confusing situation is remote control when using SSH, because this time more than a layer of SSH server to the SSH client to send the character in accordance with some mapping rules mapped to byte data and SSH client Select a mapping rule to map bytes to characters. The encoding of the
3:jsp file. Use pageencoding in the JSP to specify the encoding of the JSP source file. The principle is the same as the Java source code in the first article above. But why JSP needs and. java files do not need it. Because the JSP is compiled on the server. You write JSP files on the local machine, stored for the default encoding GBK, to the server in case the server for Utf-8 code, if the same as the. Java strategy is not going to go wrong?
The 4:web server sends data to the browser: because you want to pass character data over the network, you need to map the characters to bytes in some way, and at the browser end, the browser receives the byte data and then chooses some mapping mode to reflect the shot as a character for the user to watch. Similarly, an error occurs if the selected mapping does not match. So you can have the server tell the client how to encode. this information, such as CONTENT-TYPE:TEXT/HTML;CHARSET=GBK, can be included in the HTTP message headers, which can be implemented by Response.setcontenttype (using <%@ page in the JSP) Contenttype= ""%> instruction will be converted to Response.setcontenttype code). If pageencoding is specified in the JSP but no contentType is specified, the generated servlet code defaults to the Pageencoding encoding setting ContentType. If the ContentType server is not set ( Tomcat) is set to Iso-8859-1 by default.
5: Data that the browser sends to the server (related to the browser and browser configuration): It is also transmitted over the network. So you need to know what the client is using. Mapping rules map the characters entered by the user to the server. The client typically uses the encoding of the page to send the data (which I have not carefully tested), but the Web server (tomcat) is by default iso-8859-1 mapping rules, Therefore, the mapping rules of the two sides are inconsistent. The solution is
Request.setcharacterencoding sets the mapping rules for byte data sent by the server to the browser (different servers may not have the same method implementation policy, and WebLogic and Tomcat are not the same in the project, WebLogic the method will Both the HTTP header and the content block data are set to the mapping rule using the parameters of the method, but in Tomcat the mapping rule is used only for the content block, and the mapping rule of the message header is uriencoding configured in the Server.xml connector.
Different browsers may also exist differently. In IE6, I test the server to send the Utf-8 code, by grasping the packet found that the server sent down the data is indeed utf-8, but if through the URL to pass such as <a href= "1.jsp?a= while"/> results found that IE transmission is e8,b6 (a byte less), and "While" the Utf-8 code is e8,b6,81. However, if submitted through form, whether using the Get/post method, the transmission is "%e8,%b6,%81"
This encoding, which starts with "%", is the use of hexadecimal notation for binary systems. But in opera, all three of the above methods are passed "%e8,%b6,%81". So what if the server is sending GBK code? hehe Test it yourself
6:javascript encoding: JavaScript is also related to browsers, ie, use the XMLHttpRequest open method to invoke open ("Get", "2.jsp?a= person"), although the current page is UTF-8 encoded, But JavaScript is passing C8CB, which is the GBK encoding of "human". And in opera passed is%e4%ba%ba, is correct. If the server sends the byte is utf-16 code, ie still passes the "person" the GBK code, opera also still passes is the Utf-8 the code.
I don't know if the browser can be configured where
7: Database Coding: The database is not good to test, because it also involves the database client coding problems. If the correct characters are stored in the database file according to the encoding specified by the database, the database server sends the correct word Fu Fa to the client when querying the data, but the client character byte mapping rule, if set incorrectly, may cause the user to mistakenly believe that the wrong characters are stored in the database. the best approach is to grab the packet analysis, but the database protocol is very complex and unable to grab the package,
But if an error occurs in a process, as described above, it must be the browser <->web server, Web server <-> database, database <-> database client, SSH client <-> There was a problem somewhere between SSH server and so on.
To resolve a garbled problem caused by inconsistent character byte mapping rules between the two sides:
1: is to tell the other party what the mapping rule converts characters to bytes
2: With the agreement, both parties are mapped with a rule
3: Similar to the XML in the first line of the data is the following mapping rules map
4: In the file header store special Byte flag mapping rules, such as Windows processing utf-16 (will cause if "unicom" two words appear in the beginning of the document problems, we can search the Internet for this situation, analysis and analysis, hehe)