Copyright Disclaimer: If you have a reprinted request, please indicate the source:Http://blog.csdn.net/yzhz Yang Zheng
I. Problems:
The coding problem is that Java beginners often encounter problems during web development. There are also a large number of related problems on the Internet.ArticleBut many of the articles do not use non-ASCII characters such as Chinese characters in the URL to cause the server backgroundProgramMake an accurate explanation and explanation of the garbled problem. This article describes in detail the garbled characters caused by the use of non-ASCII characters such as Chinese Characters in URLs.
1. The Chinese characters in the URL usually appear in the following two places:
(1), the parameter value in the query string, such as http://search.china.alibaba.com/search/offer_search.htm? KEYWORDS = China
(2) Servlet Path, for example: http://search.china.alibaba.com/selloffer/chinese .html
2. The main reasons for Garbled text are as follows:
(1) Browser: our client (browser) itself does not follow the URI encoding specification (http://www.w3.org/International/O-URL-code.html ).
(2) servlet server: The servlet server is not correctly configured.
(3) Developers do not understand servlet specifications and API meanings.
Ii. Basic Knowledge:
1. Several steps of an HTTP request:
Browser (ie Firefox) [Get/post] ------------> servlet server -------------------------------> Browser display
Encode and decode the displayed content to Unicode.
(1) The browser encodes the URL (and the content submitted by post) and sends it to the server.
(2) The servlet server actually refers to implementing servletrequestwrapper by the servlet provided by the servlet server. The servlet implementations of different application servers are different. These servlet implementations decodes and converts the content to Unicode, after processing is complete, return the result (web page) encoding to the browser.
(3) the browser displays the webpage according to the specified code.
Character sets are involved when encoding and decoding strings, typically used for ISO8859-1, GBK, UTF-8, Unicode.
2. url composition:
Domain Name: Port/contextpath/servletpath/pathinfo? Querystring
Note:
1. contextpath is specified in the servlet server configuration file.
For WebLogic:
Contextpath is configured in weblogic. xml of the application.
<Context-root>/</context-root>
For Tomcat:
Contextpath is configured in server. xml.
<Context Path = "/" docbase = "D:/Server/blog. War" DEBUG = "5" reloadable = "true" crosscontext = "true"/>
For jboos:
Contextpath is configured in the jboss-web.xml of the application.
<JBoss-web>
<Context-root>/</context-root>
</JBoss-web>
2. servletpath is configured in the web. xml of the application.
<Servlet-mapping>
<Servlet-Name> example </servlet-Name>
<URL-pattern>/example/* </url-pattern>
</Servlet-mapping>
2. servlet API
Use the following servlet API to obtain URL values and parameters.
Request. getparameter ("name"); // obtain the querystring parameter value (from get and post), whose value passes through the URL Decode of the servlet Server
Request. getpathinfo (); // note: the string returned by pathinfo passes through the URL Decode of the servlet server.
Requesturi = request. getrequesturi (); // The content is the raw data submitted by the contextpath/servletpath/pathinfo browser, which has not been decode by the servlet server URL.
3. Developers must understand servlet specifications:
(1) The httpservletrequest. setcharacterencoding () method is only applicable to setting the encoding of the Request body submitted by post instead of the querystring encoding submitted by the get method. This method tells the application server what encoding should be used to parse the post-uploaded content. This is not described in many articles.
(2) The result returned by httpservletrequest. getpathinfo () is decoded by the servlet server (decode.
(3) The string returned by httpservletrequest. getrequesturi () has not been decoded by the servlet server.
(4) data submitted by post is part of the Request body.
(5) The role of contenttype ("text/html; charset = GBK") in the HTTP header of a webpage:
(A) Tell the browser what encoding the data on the webpage is;
(B) When submitting a form, usually the browser will encode the data in the form according to the charset specified by contenttype and then send it to the server.
Note that contenttype refers to the contenttype of the HTTP header, rather than the contenttype in Meta on the webpage.
3. The following are examples of browsers and application servers:
URL: http: // localhost: 8080/example/China? Name = China
Binary representation of Chinese Characters
China UTF-8 0xe4 0xb8 0xad 0xe5 0x9b 0xbd [-28,-72,-83,-27,-101,-67]
China GBK 0xd6 0xd0 0xb9 0xfa [-42,-48,-71,-6]
China ISO8859-1 0x3f, 0x3f [63, 63] information loss
(1) Browser
1. When the get method is submitted, the browser will perform URL encode on the URL and then send it to the server.
(1) For Chinese IE, if you select in advanced options to always send in UTF-8 (by default), pathinfo is URL encode is encoded by UTF-8, querystring is encoded by GBK.
Http: // localhost: 8080/example/China? Name = China
Actually, the submission is:
GET/example/% E4 % B8 % ad % E5 % 9B % BD? Name = % D6 % D0 % B9 % fa
(1) For Chinese IE, If you cancel the total sending in UTF-8 in advanced options, pathinfo and querystring are URL encode encoded according to GBK.
Actually, the submission is:
GET/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa
(3) For Firefox, both pathinfo and querystring are URL encode encoded according to GBK.
Actually, the submission is:
GET/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa
Obviously, different browsers and different settings of the same browser will affect the pathinfo encoding in the final URL. Both IE and Firefox in Chinese use GBK-encoded querystring.
Summary: solution:
1. If the URL contains non-ASCII characters such as Chinese characters, the browser will urlencode them. To prevent browsers from using the encoding we do not want, it is best not to directly use non-ASCII characters in the URL, but use the URL encoded string %.
For example:
URL: http: // localhost: 8080/example/China? Name = China
Suggestion:
URL: http: /localhost: 8080/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa
2. We recommend that pathinfo and querystring in the URL adopt the same encoding, which makes processing on the server easier.
2. Another problem is that many programmers do not understand that the URL encode requires a character set. Don't understand people can look at this document: http://gceclub.sun.com.cn/Java_Docs/html/zh_CN/api/java/net/URLEncoder.html
2. Post submission
In post mode, the parameter value pairs in the form are sent to the server through the request body. In this case, the browser will send the request to the server based on the contenttype ("text/html; charset = GBK") of the webpage ") and then send it to the server.
In the server-side program, we can set the encoding through request. setcharacterencoding (), and then get the correct data through request. getparameter.
Solution:
1. In terms of simplicity and minimum cost, it is more appropriate for us to use uniform encoding for URL and webpage encoding.
If unified encoding is not used, we need to perform some encoding conversion tasks in the program. This is why we have seen a lot of information on the Internet about how to handle Garbled text. Many of the solutions are just a matter of time, and they have not fundamentally solved the problem.
(2) servlet Server
The servlet implemented by the servlet server encounters a string containing % in the data submitted by the URL and post. It will be decoded according to the specified character set. The results returned by the following two servlet methods are decoded:
Request. getparameter ("name ");
Request. getpathinfo ();
The "specified character set" is configured in the configuration file of the application server.
(1) Tomcat server
For the Tomcat server, the file is server. xml
<Connector Port = "8080" protocol = "HTTP/1.1"
Maxthreads = "150" connectiontimeout = "20000"
Redirectport = "8443" uriencoding = "GBK"/>
Uriencoding tells the server servlet the encoding used for URL Decoding.
<Connector Port = "8080"... usebodyencodingforuri = "true"/>
Usebodyencodingforuri indicates that the server uses the encoding specified by the request body to decode the URL.
(2) WebLogic Server
For the WebLogic Server, the file is weblogic. xml
<Input-charset>
<Java-charset-Name> GBK </Java-charset-Name>
</Input-charset>
(3) Browser display
The browser decodes the byte stream sent by the server based on the specified character set of contenttype ("text/html; charset = GBK") in the HTTP header. We can call httpservletresponse. setcontenttype () to set the contenttype of the HTTP header.
Summary:
1. The encoding and decoding of pathinfo and querystring in the URL are determined by the configuration of the browser and Application Server. Our program cannot be set and we do not expect to use request. the setcharacterencoding () method can be used to set the character set for URL parameter value decoding.
Therefore, we recommend that you do not use non-ASCII characters such as Chinese characters in the URL. If the URL contains non-ASCII characters, use urlencode encoding. For example:
Http: // localhost: 8080/example1/example/China
Correct syntax:
HTTP: /localhost: 8080/example1/example/% E4 % B8 % ad % E5 % 9B % BD
We recommend that you do not use non-ASCII characters in both pathinfo and querystring in the URL, for example
Http: // localhost: 8080/example1/example/China? Name = China
The reason is simple: different browsers use different character sets for pathinfo and querystring encoding in URLs, but the application server uses the same character set for URL Decoding.
2. We recommend that the character set of URL encode in the URL be the same as the character set of the contenttype on the webpage, so that the implementation of the program is very simple and there is no need for complicated encoding conversion.