A description of the HTTP request encoding problem (turn) __ Code

Source: Internet
Author: User

In the development of the web will often encounter garbled problems, garbled general out now:

1. Written in the JSP file in the Chinese into garbled

2. The Chinese character of the page becomes garbled

3. Background through Request.getparameter () garbled Code of the basic knowledge

Computers can store and transmit information in bytes only, while people need to look at strings, and the correspondence between byte and string is character set, such as the byte in the character "UTF-8" using the character Set map: E4 B8 AD Three bytes, and vice versa, These three bytes can get the "medium" character by UTF-8 character set mapping, different character set mapping rules are not the same, the scope can be expressed is not the same, such as "Medium" in GB2312 in the corresponding byte represented as: D6 D0 two bytes, the conversion between characters and bytes, described as encoding and decoding:

L-Character-> byte: encoding, for example: "Medium" UTF-8 encoded as E4 B8 AD

L-byte-> character: decoding, for example: byte array D0 D6 decoding to character "medium" according to GB2312

There is also a class of encodings called URI encoding and URI decoding, but the URI encoding and decoding is not a conversion between a string and a byte stream, but rather a string representing another string, for example:

L "Medium" UTF-8 URI encoded as%e4%b8%ad

L string%e4%b8%ad decodes the URI according to UTF-8 to the character "medium"

As can be seen, the URI encoding is represented by a string of%+ corresponding to the character set, in Java the String class has two common methods for encoding and decoding:

L GetBytes: For example, "Medium". GetBytes ("character set") encoded according to the specified character set

L String (bytes[], "character set"): Decoding a byte array based on a custom character set Why is there a garbled problem

Browsers and servers are connected through the network, browser request encoded into a byte stream transmission on the network, the application server to receive the browser sent over the byte stream after the corresponding character set and then decoded to a string, if the browser and server-side use of different character sets or incompatible character sets will lead to garbled problems, for example, Browser will "medium" in accordance with UTF-8 encoded as byte E4 B8 AD, transmission on the network, the application server received after the word after the GBK to decode, then the first two bytes were decoded to "trickle", this is just a simple process, the actual process than this to be complex.

Therefore, to understand the coding problem of the web, it is necessary to clarify the process of the request and the process of coding and decoding. The web is a request-response pattern, a user manipulating a browser, such as clicking a button to submit a form, or clicking a hyperlink, when the browser sends a request to the application server, and the servlet container receives the request,    According to the Web.xml settings call the appropriate application, the application based on the request to send a certain logical processing after the browser returned to a section of HTML code, browser based on HTML parsing and display to the user, this is a request to answer the process, the following sections for more detailed description: 1. The browser sends a request to the application server

Browsers send requests to the application server generally in three ways: 1. Submit the form, 2. hyperlinks, 3.Ajax;

1 form Submission

Form submission is divided into post and get two ways,

When the Post method is used, the browser sends the string from the form to the server using the character set code of the page as the byte stream.

When the Get method is used, the browser first encodes the values in the form into the application server after the page's character set is encoded into the URL of the action, for example:

<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 "/>

<title>test</title>

<body>

<form action=http://www.google.com>

<input type=text name=test value= "Zhong"/>

<input type=submit/>

</form>

</body>

When you click Submit, the URL is Www.google.com.hk/?test=%E4%B8%AD, you can see that the word "medium" is encoded as%E4%B8%AD, because the character set of the current page is UTF-8, Therefore, the URI encoding is followed by the UTF-8 character set when a Get form submission is made.

2) Hyperlink

Parameters are generally passed in the hyperlinks, and sometimes Chinese, such as this code: <a href= "http://www.google.com/?test=" >link</a>, the browser eventually sent to the server are the ASC characters, Where the "medium" does not belong to the ASC character set, it will also be encoded in the URI, but different browsers use the character set is not the same, such as the above hyperlink fragment, in Windows7, regardless of the page content= "text/html;" Charset=utf-8 "or content=" text/html; CHARSET=GBK "IE8 sent is www.google.com.hk/?test=%D6%D0, you can see that this is the GBK URI encoding, IE8 the URL encoding in the hyperlink is independent of the page encoding, and the system's default encoding; In XP, IE8 sends the URI encoding of the character set used in the page encoding, if the page encoding is GBK,IE6 sent for GBK page encoding, and if the page encoding is UTF-8 sends only the first two bytes of the UTF-8 URI encoding; in other browsers, such as Firefox and Chrome, the URI code is encoded by the page.

Operating system

Browser

Page encoding

The request string sent

Description

Windows7

IE8 Chinese

UTF-8

Test=%d6%d0

In Windows7, the URI encoding with the GBK character set is irrelevant to the encoding of the page

Windows7

IE8 Chinese

GBK

Test=%d6%d0

Xp

IE8 Chinese

UTF-8

Test=%e4%b8%ad

URI encoding in XP with a page-coded character set

Xp

IE8 Chinese

GBK

Test=%d6%d0

Windows2003

IE6 Chinese

GBK

Test=%d6%d0

GBK is right and UTF-8 is not right

Windows2003

IE6 Chinese

UTF-8

Test=%e4%b8

--

Chrome Chinese, Firefox english

UTF-8

Test=%e4%b8%ad

URI encoding using a page-coded character set

--

Chrome Chinese, Firefox english

GBK

Test=%d6%d0

As can be seen, directly in the URL with Chinese, the different versions of IE in different operating systems to encode the URI of the result may not be the same, Chrome and Firefox use encoding and form of the code is consistent, so, directly in the link to write non-ASC characters is very dangerous, Because characters are encoded in a way that is related to the client's environment. So in order to avoid the browser to make an indeterminate URI encoding, the need to encode the Chinese in the program after the URI encoding in the URL, JavaScript provides the encodeURI () function, which provides the UTF-8 URI encoding, can also be encoded by Java.net.URLEncoder.encode (str, "character set")

3) Ajax

Ajax can specify a GET or post mode, and the situation is similar to the 2 mentioned above . Application Server Get Parameters

In the servlet generally through the request.getparameter () to get the parameters sent by the browser, it should be noted that the server servlet to receive the bottom of the InputStream, that is, the byte stream, Request.getparameter () returns a string, so there is a decoding process within the GetParameter () method, and the character set used for decoding may vary depending on the application server and the operating system. The ServletRequest interface provides a way: Setcharacterencoding () to set the character set of the GetParameter decoding, which must be called before GetParameter, By looking at the source code of Tomcate, GetParameter Initializes a map object on the first call, which stores the parameter names and parameter values, which are decoded according to the set's character set, once the objects have been decoded, The next call takes a value directly from the map without having to decode it again, so setcharacterencoding must be invoked before getparameter, and it is said that this method is valid only for post pass parameters and not for the parameters passed by the Get method. This is true for TOMCAT5, but it is also valid for both WebSphere and apsuic,setcharacterencoding for post and get.

Application Server

Default encoding of the system on which the server resides

Page encoding

How to submit

URI encoding

Setcharacterencoding

GetParameter Results

Note

websphere6.1

GBK

UTF-8

POST

-

UTF-8

That's right

Server Default configuration

POST

-

GBK

Error

Get

UTF-8

UTF-8

That's right

Hyperlinks

GBK

GBK

That's right

tomcat5.5

GBK

UTF-8

POST

UTF-8

That's right

Uriencoding and Usebodyencodingforuri not set

POST

GBK

Error

Get

UTF-8

UTF-8

Error

Hyperlinks

GBK

GBK

Error

apusic5.1

GBK

UTF-8

POST

UTF-8

That's right

Server Default configuration

POST

GBK

Error

Get

UTF-8

UTF-8

That's right

Hyperlinks

GBK

GBK

That's right

As you can see from the table above, The websphere6.1,apusic5.1 application server's Get and Post methods GetParameter decode the character set used by Setcharacterencoding, and the Tomcat5 post method uses the Setcharacterenc Oding, but the Get method is not. In looking back at the process of these experiments, the browser uses the Post method will be used in the page character set encoding into a byte stream to the server, the server received the word stream after the setcharacterencoding set according to the character set to decode, get string,

That is, if you use the Post method submission, as long as you guarantee the "character set of the coded character sets =setcharacterencoding the page" then GetParameter gets the correct value, get and hyperlink in the same way,

When a form is submitted using GET, it is encodeuri based on the encoding of the page, and the hyperlink can be encoded according to the specified character set.

The common denominator in both ways is that browsers encode the URI, and in WebSphere, get and hyperlink are the same as if the "uri-coded character set =setcharacterencoding's character set" would be the getparameter result. The URI-coded character set of the hyperlink, which is used to submit the form when it is submitted, has a "uri-coded Character set = page-coded charset", which is described above, and the URI-coded character set of the Chrome and Firefox browsers is the character set of the page encoding, but IE is not, without

In tomcat5.5, GetParameter gets the parameters passed by the Get method or hyperlink by default, the iso8859-1 is used for decoding, such as the request that the browser sends the UTF-8 encoding, tomcat5.5 the getparameter uses the iso8859-1 to decode, and the result is wrong. , if you want to get the correct value, you need to use UTF-8 to decode the tomcat5.5 getparameter, by setting uriencoding= "UTF-8" or usebodyencodingforuri= "true", It allows Tomcat to use UTF-8 decoding (usebodyencodingforuri= "true" for the decoded character set using the same character set as the page encoding), if it is not configured and needs to get the correct value, GetParameter. You need the program to turn the code, because the GetParameter is decoded by Iso8859-1, all first by GetParameter (). GetBytes ("iso8859-1") encoded into the original byte array, and then decoded to a string using the UTF-8 character set    : New String (GetParameter (). GetBytes ("Iso8859-1"), "UTF-8") 3. Set the browser's page encoding

The server sent to the browser is also encoded into a stream of bytes in the network transmission, the browser received a byte stream after the use of the specified character set decoded into a string again to show, if the two links of the character set inconsistency will lead to garbled problems,

For example, static HTML files or JSP are saved in UTF-8, you need to tell the browser to use UTF-8 to decode,

If it is JSP can pass <%@ page contenttype= "text/html;" Charset=utf-8 "Language=" Java%> to set up, static files can be <meta http-equiv= "Content-type" content= text/html; Charset=utf-8 "/> is set, if the output is directly in the servlet, you can pass response.setcharacterencoding (" UTF-8 "), setContentType (" text/html ; Charset=utf-8 "), SetHeader (" Content-type "," Text/html;charset=utf-8 ") set,

These operations are equivalent to adding "content-type:text/html;charset=utf-8" information to the head of the response,

The encoding information in the header is given precedence over the HTML META tag, which means that if the setContentType ("Text/html;charset=utf-8") is set in the Serlvet, the JSP sets the <meta http-equiv= "Content-type" content= "text/html; CHARSET=GBK "/> The browser will decode according to the UTF-8 character set,

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.