Encoding issues during HTTP (Get/post) requests

Source: Internet
Author: User
Tags url decode urlencode jboss

The following content is reprint content, very good.

First, the question:
Coding problem is Java beginners in the Web development process often encounter problems, there are a lot of relevant articles on the Internet, but many of the articles do not use the URL in the Chinese and other non-ASCII characters caused by the server background program parsing garbled problems to make accurate explanations and explanations. This article will detail the problem of garbled characters due to the use of a non-ASCII character such as Chinese in the URL.

1. In the URL, Chinese characters usually appear in the following two places:
(1), parameter values in Query string, such as http://search.china.alibaba.com/search/offer_search.htm?keywords= China
(2), servlet path, for example: http://search.china.alibaba.com/selloffer/China. html

2, the main reasons for garbled problems are the following:
(1), Browser: Our client (browser) itself does not follow the URI encoding specification (http://www.w3.org/International/O-URL-code.html).
(2), servlet server: The servlet server is not configured correctly.
(3), developers do not understand the specifications of the servlet and the meaning of the API.

Second, the basic knowledge:
1, an HTTP request through a few links:
Browser (ie Firefox) "get/post"------------>servlet server-------------------------------> browser display
The encoding is decoded into Unicode, and the displayed content is encoded and decoded
(1) The browser encodes the URL (and post-submitted content) and sends it to the server.
(2) The servlet server here actually refers to the servlet implementation Servletrequestwrapper provided by the servlet server, and the servlet implementations of different application servers are different. The implementations of these servlets convert the content to Unicode, and after processing, the results (that is, the Web page) are encoded back to the browser.
(3) The browser displays the page according to the specified encoding.

Character sets are involved when encoding and decoding strings, usually using Iso8859-1, GBK, UTF-8, and UNICODE.

2, the composition of the URL:
Domain Name: Port/contextpath/servletpath/pathinfo?querystring
Description

1. ContextPath is specified in the configuration file of the servlet server.
For WebLogic:
ContextPath is configured in the application's weblogic.xml.
<context-root>/</context-root>

For Tomcat:
ContextPath is configured in Server.xml.
<context path= "/" docbase= "D:/server/blog.war" debug= "5" reloadable= "true" crosscontext= "true"/>

For Jboos:
ContextPath is configured in the application's jboss-web.xml.
<jboss-web>
<context-root>/</context-root>
</jboss-web>

2. Servletpath is configured in the application's Web. Xml.
<servlet-mapping>
<servlet-name>Example</servlet-name>
<url-pattern>/example/*</url-pattern>
</servlet-mapping>

2, servlet API
We use the following Servlet API to get the value and parameters of the URL.
Request.getparameter ("name");        //Gets the parameter value of QueryString ( From Get and post), whose value has been passed through the servlet server URL decode
request.getpathinfo ();                //Note: The string returned by PathInfo is decode through the servlet server URL.
RequestUri = Request.getrequesturi ();//content: Contextpath/servletpath/pathinfo The original data submitted by the browser, not by the servlet server URL Decode too.

3. The developer must be clear about the servlet specification:
(1) The Httpservletrequest.setcharacterencoding () method only applies to the request to set the post submission The body is encoded instead of the querystring encoding that is set by the Get method. This method tells the application server what encoding to use to parse the post-transmitted content. Many articles do not illustrate this point. The result returned by
(2) Httpservletrequest.getpathinfo () is decoded by the servlet server (decode). The string returned by
(3) Httpservletrequest.getrequesturi () was not decoded by the servlet server.
(4) The data submitted by post is part of the request body.
(5) The HTTP header of the Web page is contenttype ("text/html; CHARSET=GBK "):
   (a) tells the browser what data is encoded in the Web page;
   (b) When the form is submitted, Usually the browser encodes the data in the form according to the charset specified by ContentType and then sends it to the server.
   Note here that the ContentType referred to here refers to the contenttype of the HTTP header, not the ContentType in Meta in the Web page.

Three, below we separately from the browser and application server to illustrate:
url:http://localhost:8080/example/China? Name= China
Chinese character Coding binary representation
China UTF-8 0xe4 0xb8 0xad 0xe5 0x9b 0xbd[-28,-72,-83,-27,-101,-67]
China GBK 0xd6 0xd0 0xb9 0xfa[-42,-48,-71,-6]
China iso8859-1 0x3f,0x3f[63, 63] information lost

(a), browser
1, get the way to submit, the browser will URL URL encode, and then sent to the server.
(1) For Chinese IE, if the advanced option is always sent in UTF-8 (the default), then PathInfo is the URL encode is encoded according to UTF-8, QueryString is encoded according to GBK.
http://localhost:8080/example/China? Name= China
Actually commits are:
get/example/China? Name=?й?

(1) For Chinese IE, if the cancellation in the advanced option is always sent in UTF-8, then PathInfo and querystring are URL encode according to GBK encoding.
Actually commits are:
Get/example/?й?? Name=?й?

(3) For Chinese Firefox, pathinfo and querystring are URL encode encoded according to GBK.
Actually commits are:
Get/example/?й?? Name=?й?

Obviously, different browsers and different settings for the same browser will affect the encoding of the pathinfo in the final URL. For Chinese IE and Firefox are used GBK encoding querystring.

Summary: Solution:
1, if the URL contains non-ASCII characters such as Chinese, then the browser will be urlencode them. In order to avoid the browser using the encoding we do not want, it is best not to use non-ASCII characters directly in the URL, instead of using the URL encode encoded string%.
Like what:
url:http://localhost:8080/example/China? Name= China
Suggestions:
Url:http://localhost:8080/example/?й?? Name=?й?

2, we recommend the URL PathInfo and querystring use the same encoding, so the server-side processing will be more simple.

2, there is a problem, I found that many programmers do not understand that URL encode is required to specify a character set. People who don't understand can take a look at this document: http://gceclub.sun.com.cn/Java_Docs/html/zh_CN/api/java/net/URLEncoder.html

2. Post Submission
For post, the parameter value pairs in the form are sent to the server via the request body, where the browser is based on the ContentType of the page ("text/html; CHARSET=GBK ") encodes the data in the form and then sends it to the server.
In the server-side program we can set the encoding by Request.setcharacterencoding () and then get the correct data through Request.getparameter.

Solution:
1, from the simplest, the least cost to see, we have a URL and the code in the page using a uniform encoding for us is more appropriate.
If we don't use uniform coding, we need to do something about transcoding in our program. This is why we see a lot of information on the network on how to deal with garbled characters, many of the solutions are just temporary expediency, not fundamentally solve the problem.

(ii), servlet server
Servlets implemented by the Servlet server encounter a string in the URL and post-submitted data that contains%, which is decoded according to the specified character set. The results returned by the following two servlet methods are decoded:
Request.getparameter ("name");
Request.getpathinfo ();

The "specified character set" described here is configured in the application server's configuration file.

(1) Tomcat server
For the Tomcat server, the file is Server.xml
<connector port= "8080" protocol= "http/1.1"
maxthreads= "connectiontimeout=" 20000 "
Redirectport= "8443" uriencoding= "GBK"/>
Uriencoding tells the server servlet which encoding to use when decoding the URL.

<connector port= "8080" ... usebodyencodingforuri= "true"/>
Usebodyencodingforuri tells the server to decode the URL by using the encoding specified by the request body.

(2) WebLogic Server
For WebLogic servers, this file is Weblogic.xml
<input-charset>
<java-charset-name>GBK</java-charset-name>
</input-charset>

(iii) browser display
The browser is based on the contenttype in the HTTP header ("text/html; CHARSET=GBK "), specifies the character set to decode the byte stream sent over by the server. We can call Httpservletresponse.setcontenttype () to set the HTTP header's contenttype.

Summarize:
1, the URL of the PathInfo and QueryString string encoding and decoding is determined by the browser and application server configuration, our program can not be set, do not expect to use request.setcharacterencoding () method to set the character set when the parameter value in the URL is decoded.
Therefore, we recommend that the URL do not use non-ASCII characters such as Chinese, if it contains non-ASCII characters to use UrlEncode encoding, such as:
http://localhost:8080/example1/example/China
The correct wording:
http://localhost:8080/example1/example/China
Also, we recommend that you do not use non-ASCII characters in PathInfo and querystring in URLs, such as
http://localhost:8080/example1/example/China? Name= China
The reason is simple: different browsers use different character sets for encoding PathInfo and querystring in URLs, but the application server usually decodes the same character set for URLs.

2, we recommend the URL of the URL encode encoded character set and the ContentType character set of the Web page with the same character set, so that the implementation of the program is very simple, do not have to do complex encoding conversion.

Encoding issues during HTTP (Get/post) requests

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.