In the development of the web will often encounter garbled problems, garbled general out now:
1. Written in the JSP file in the Chinese into garbled
2. The Chinese character of the page becomes garbled
3. Background through Request.getparameter () garbled Code of the basic knowledge
Computers can store and transmit information in bytes only, while people need to look at strings, and the correspondence between byte and string is character set, such as the byte in the character "UTF-8" using the character Set map: E4 B8 AD Three bytes, and vice versa, These three bytes can get the "medium" character by UTF-8 character set mapping, different character set mapping rules are not the same, the scope can be expressed is not the same, such as "Medium" in GB2312 in the corresponding byte represented as: D6 D0 two bytes, the conversion between characters and bytes, described as encoding and decoding:
L-Character-> byte: encoding, for example: "Medium" UTF-8 encoded as E4 B8 AD
L-byte-> character: decoding, for example: byte array D0 D6 decoding to character "medium" according to GB2312
There is also a class of encodings called URI encoding and URI decoding, but the URI encoding and decoding is not a conversion between a string and a byte stream, but rather a string representing another string, for example:
L "Medium" UTF-8 URI encoded as%e4%b8%ad
L string%e4%b8%ad decodes the URI according to UTF-8 to the character "medium"
As can be seen, the URI encoding is represented by a string of%+ corresponding to the character set, in Java the String class has two common methods for encoding and decoding:
L GetBytes: For example, "Medium". GetBytes ("character set") encoded according to the specified character set
L String (bytes[], "character set"): Decoding a byte array based on a custom character set Why is there a garbled problem
Browsers and servers are connected through the network, browser request encoded into a byte stream transmission on the network, the application server to receive the browser sent over the byte stream after the corresponding character set and then decoded to a string, if the browser and server-side use of different character sets or incompatible character sets will lead to garbled problems, for example, Browser will "medium" in accordance with UTF-8 encoded as byte E4 B8 AD, transmission on the network, the application server received after the word after the GBK to decode, then the first two bytes were decoded to "trickle", this is just a simple process, the actual process than this to be complex.
Therefore, to understand the coding problem of the web, it is necessary to clarify the process of the request and the process of coding and decoding. The web is a request-response pattern, a user manipulating a browser, such as clicking a button to submit a form, or clicking a hyperlink, when the browser sends a request to the application server, and the servlet container receives the request, According to the Web.xml settings call the appropriate application, the application based on the request to send a certain logical processing after the browser returned to a section of HTML code, browser based on HTML parsing and display to the user, this is a request to answer the process, the following sections for more detailed description: 1. The browser sends a request to the application server
Browsers send requests to the application server generally in three ways: 1. Submit the form, 2. hyperlinks, 3.Ajax;
1 form Submission
Form submission is divided into post and get two ways,
When the Post method is used, the browser sends the string from the form to the server using the character set code of the page as the byte stream.
When the Get method is used, the browser first encodes the values in the form into the application server after the page's character set is encoded into the URL of the action, for example:
<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 "/> <title>test</title> <body> <form action=http://www.google.com> <input type=text name=test value= "Zhong"/> <input type=submit/> </form> </body> |
When you click Submit, the URL is Www.google.com.hk/?test=%E4%B8%AD, you can see that the word "medium" is encoded as%E4%B8%AD, because the character set of the current page is UTF-8, Therefore, the URI encoding is followed by the UTF-8 character set when a Get form submission is made.
2) Hyperlink
Parameters are generally passed in the hyperlinks, and sometimes Chinese, such as this code: <a href= "http://www.google.com/?test=" >link</a>, the browser eventually sent to the server are the ASC characters, Where the "medium" does not belong to the ASC character set, it will also be encoded in the URI, but different browsers use the character set is not the same, such as the above hyperlink fragment, in Windows7, regardless of the page content= "text/html;" Charset=utf-8 "or content=" text/html; CHARSET=GBK "IE8 sent is www.google.com.hk/?test=%D6%D0, you can see that this is the GBK URI encoding, IE8 the URL encoding in the hyperlink is independent of the page encoding, and the system's default encoding; In XP, IE8 sends the URI encoding of the character set used in the page encoding, if the page encoding is GBK,IE6 sent for GBK page encoding, and if the page encoding is UTF-8 sends only the first two bytes of the UTF-8 URI encoding; in other browsers, such as Firefox and Chrome, the URI code is encoded by the page.
Operating system |
Browser |
Page encoding |
The request string sent |
Description |
Windows7 |
IE8 Chinese |
UTF-8 |
Test=%d6%d0 |
In Windows7, the URI encoding with the GBK character set is irrelevant to the encoding of the page |
Windows7 |
IE8 Chinese |
GBK |
Test=%d6%d0 |
Xp |
IE8 Chinese |
UTF-8 |
Test=%e4%b8%ad |
URI encoding in XP with a page-coded character set |
Xp |
IE8 Chinese |
GBK |
Test=%d6%d0 |
Windows2003 |
IE6 Chinese |
GBK |
Test=%d6%d0 |
GBK is right and UTF-8 is not right |
Windows2003 |
IE6 Chinese |
UTF-8 |
Test=%e4%b8 |
-- |
Chrome Chinese, Firefox english |
UTF-8 |
Test=%e4%b8%ad |
URI encoding using a page-coded character set |
-- |
Chrome Chinese, Firefox english |
GBK |
Test=%d6%d0 |
As can be seen, directly in the URL with Chinese, the different versions of IE in different operating systems to encode the URI of the result may not be the same, Chrome and Firefox use encoding and form of the code is consistent, so, directly in the link to write non-ASC characters is very dangerous, Because characters are encoded in a way that is related to the client's environment. So in order to avoid the browser to make an indeterminate URI encoding, the need to encode the Chinese in the program after the URI encoding in the URL, JavaScript provides the encodeURI () function, which provides the UTF-8 URI encoding, can also be encoded by Java.net.URLEncoder.encode (str, "character set")
3) Ajax
Ajax can specify a GET or post mode, and the situation is similar to the 2 mentioned above . Application Server Get Parameters
In the servlet generally through the request.getparameter () to get the parameters sent by the browser, it should be noted that the server servlet to receive the bottom of the InputStream, that is, the byte stream, Request.getparameter () returns a string, so there is a decoding process within the GetParameter () method, and the character set used for decoding may vary depending on the application server and the operating system. The ServletRequest interface provides a way: Setcharacterencoding () to set the character set of the GetParameter decoding, which must be called before GetParameter, By looking at the source code of Tomcate, GetParameter Initializes a map object on the first call, which stores the parameter names and parameter values, which are decoded according to the set's character set, once the objects have been decoded, The next call takes a value directly from the map without having to decode it again, so setcharacterencoding must be invoked before getparameter, and it is said that this method is valid only for post pass parameters and not for the parameters passed by the Get method. This is true for TOMCAT5, but it is also valid for both WebSphere and apsuic,setcharacterencoding for post and get.
Application Server |
Default encoding of the system on which the server resides |
Page encoding |
How to submit |
URI encoding |
Setcharacterencoding |
GetParameter Results |
Note |
websphere6.1 |
GBK |
UTF-8 |
POST |
- |
UTF-8 |
That's right |
Server Default configuration |
POST |
- |
GBK |
Error |
Get |
UTF-8 |
UTF-8 |
That's right |
Hyperlinks |
GBK |
GBK |
That's right |
tomcat5.5 |
GBK |
UTF-8 |
POST |
|
UTF-8 |
That's right |
Uriencoding and Usebodyencodingforuri not set |
POST |
|
GBK |
Error |
Get |
UTF-8 |
UTF-8 |
Error |
Hyperlinks |
GBK |
GBK |
Error |
apusic5.1 |
GBK |
UTF-8 |
POST |
|
UTF-8 |
That's right |
Server Default configuration |
POST |
|
GBK |
Error |
Get |
UTF-8 |
UTF-8 |
That's right |
Hyperlinks |
GBK |
GBK |
That's right |
As you can see from the table above, The websphere6.1,apusic5.1 application server's Get and Post methods GetParameter decode the character set used by Setcharacterencoding, and the Tomcat5 post method uses the Setcharacterenc Oding, but the Get method is not. In looking back at the process of these experiments, the browser uses the Post method will be used in the page character set encoding into a byte stream to the server, the server received the word stream after the setcharacterencoding set according to the character set to decode, get string,
That is, if you use the Post method submission, as long as you guarantee the "character set of the coded character sets =setcharacterencoding the page" then GetParameter gets the correct value, get and hyperlink in the same way,
When a form is submitted using GET, it is encodeuri based on the encoding of the page, and the hyperlink can be encoded according to the specified character set.
The common denominator in both ways is that browsers encode the URI, and in WebSphere, get and hyperlink are the same as if the "uri-coded character set =setcharacterencoding's character set" would be the getparameter result. The URI-coded character set of the hyperlink, which is used to submit the form when it is submitted, has a "uri-coded Character set = page-coded charset", which is described above, and the URI-coded character set of the Chrome and Firefox browsers is the character set of the page encoding, but IE is not, without
In tomcat5.5, GetParameter gets the parameters passed by the Get method or hyperlink by default, the iso8859-1 is used for decoding, such as the request that the browser sends the UTF-8 encoding, tomcat5.5 the getparameter uses the iso8859-1 to decode, and the result is wrong. , if you want to get the correct value, you need to use UTF-8 to decode the tomcat5.5 getparameter, by setting uriencoding= "UTF-8" or usebodyencodingforuri= "true", It allows Tomcat to use UTF-8 decoding (usebodyencodingforuri= "true" for the decoded character set using the same character set as the page encoding), if it is not configured and needs to get the correct value, GetParameter. You need the program to turn the code, because the GetParameter is decoded by Iso8859-1, all first by GetParameter (). GetBytes ("iso8859-1") encoded into the original byte array, and then decoded to a string using the UTF-8 character set : New String (GetParameter (). GetBytes ("Iso8859-1"), "UTF-8") 3. Set the browser's page encoding
The server sent to the browser is also encoded into a stream of bytes in the network transmission, the browser received a byte stream after the use of the specified character set decoded into a string again to show, if the two links of the character set inconsistency will lead to garbled problems,
For example, static HTML files or JSP are saved in UTF-8, you need to tell the browser to use UTF-8 to decode,
If it is JSP can pass <%@ page contenttype= "text/html;" Charset=utf-8 "Language=" Java%> to set up, static files can be <meta http-equiv= "Content-type" content= text/html; Charset=utf-8 "/> is set, if the output is directly in the servlet, you can pass response.setcharacterencoding (" UTF-8 "), setContentType (" text/html ; Charset=utf-8 "), SetHeader (" Content-type "," Text/html;charset=utf-8 ") set,
These operations are equivalent to adding "content-type:text/html;charset=utf-8" information to the head of the response,
The encoding information in the header is given precedence over the HTML META tag, which means that if the setContentType ("Text/html;charset=utf-8") is set in the Serlvet, the JSP sets the <meta http-equiv= "Content-type" content= "text/html; CHARSET=GBK "/> The browser will decode according to the UTF-8 character set,