Let's take a look at three urls:
Url1. http://hi.baidu.com/aobao's Yan
Url2. http://hi.baidu.com/%E7%88%B1%E5% AE %9D%E7%9A%84%E5%A6%8D (UTF-8 coding)
Url3. http://hi.baidu.com/%B0% AE %B1%A6%B5%C4%E5%FB (GBK encoding)
These three URLs point to the same web page and can be accessed. In fact, if your browser is in a Chinese environment and iesets "Send URL with utf8", enter url1 in IE and the server gets url2; enter url1 in Firefox, the server obtains url3. Why? Listen to the following decomposition (analysis part of the content from the http://blog.csdn.net/yzhz/archive/2007/07/03/1676796.aspx ).
I. Problems:
The coding problem is that Java beginners often encounter problems during web development. There are also a large number of related articles on the internet, however, many of the articles do not provide an accurate explanation and description of the garbled characters in the backend program parsing caused by the use of non-ASCII characters such as Chinese characters in the URL. This article describes in detail the garbled characters caused by the use of non-ASCII characters such as Chinese Characters in URLs.
1. The Chinese characters in the URL usually appear in the following two places:
(1), the parameter value in the query string, such as http://search.china.alibaba.com/search/offer_search.htm? KEYWORDS = China
(2) Servlet Path, for example: http://search.china.alibaba.com/selloffer/chinese .html
2. The main reasons for Garbled text are as follows:
(1) Browser: our client (browser) itself does not follow the URI encoding specification (http://www.w3.org/International/O-URL-code.html ).
(2) servlet server: The servlet server is not correctly configured.
(3) Developers do not understand servlet specifications and API meanings.
Ii. Basic Knowledge:
1. Several steps of an HTTP request:
Browser (ie Firefox) [Get/post] ------------> servlet server -------------------------------> Browser display
Encode and decode the displayed content to Unicode.
(1) The browser encodes the URL (and the content submitted by post) and sends it to the server.
(2) The servlet server actually refers to implementing servletrequestwrapper by the servlet provided by the servlet server. The servlet implementations of different application servers are different. These servlet implementations decodes and converts the content to Unicode, after processing is complete, return the result (web page) encoding to the browser.
(3) the browser displays the webpage according to the specified code.
Character sets are involved when encoding and decoding strings, typically used for ISO8859-1, GBK, UTF-8, Unicode.
2. url composition:
Domain Name: Port/contextpath/servletpath/pathinfo? Querystring
Note:
1. contextpath is specified in the servlet server configuration file.
For WebLogic:
Contextpath is configured in weblogic. xml of the application.
<Context-root>/</context-root>
For Tomcat:
Contextpath is configured in server. xml.
<Context Path = "/" docbase = "D:/Server/blog. War" DEBUG = "5" reloadable = "true" crosscontext = "true"/>
For jboos:
Contextpath is configured in the jboss-web.xml of the application.
<JBoss-web>
<Context-root>/</context-root>
</JBoss-web>
2. servletpath is configured in the web. xml of the application.
<Servlet-mapping>
<Servlet-Name> example </servlet-Name>
<URL-pattern>/example/* </url-pattern>
</Servlet-mapping>
2. servlet API
Use the following servlet API to obtain URL values and parameters.
Request. getparameter ("name"); // obtain the querystring parameter value (from get and post), whose value passes through the URL Decode of the servlet Server
Request. getpathinfo (); // note: the string returned by pathinfo passes through the URL Decode of the servlet server.
Requesturi = request. getrequesturi (); // The content is the raw data submitted by the contextpath/servletpath/pathinfo browser, which has not been decode by the servlet server URL.
3. Developers must understand servlet specifications:
(1) The httpservletrequest. setcharacterencoding () method is only applicable to setting the encoding of the Request body submitted by post instead of the querystring encoding submitted by the get method. This method tells the application server what encoding should be used to parse the post-uploaded content. This is not described in many articles.
(2) The result returned by httpservletrequest. getpathinfo () is decoded by the servlet server (decode.
(3) The string returned by httpservletrequest. getrequesturi () has not been decoded by the servlet server.
(4) data submitted by post is part of the Request body.
(5) The role of contenttype ("text/html; charset = GBK") in the HTTP header of a webpage:
(A) Tell the browser what encoding the data on the webpage is;
(B) When submitting a form, usually the browser will encode the data in the form according to the charset specified by contenttype and then send it to the server.
Note that contenttype refers to the contenttype of the HTTP header, rather than the contenttype in Meta on the webpage.
3. The following are examples of browsers and application servers:
URL: http: // localhost: 8080/example/China? Name = China
Binary representation of Chinese Characters
China UTF-8 0xe4 0xb8 0xad 0xe5 0x9b 0xbd [-28,-72,-83,-27,-101,-67]
China GBK 0xd6 0xd0 0xb9 0xfa [-42,-48,-71,-6]
China ISO8859-1 0x3f, 0x3f [63, 63] information loss
(1) Browser
1. When the get method is submitted, the browser will perform URL encode on the URL and then send it to the server.
(1) For Chinese IE, if you select in advanced options to always send in UTF-8 (by default), pathinfo is URL encode is encoded by UTF-8, querystring is encoded by GBK.
Http: // localhost: 8080/example/China? Name = China
Actually, the submission is:
GET/example/% E4 % B8 % ad % E5 % 9B % BD? Name = % D6 % D0 % B9 % fa
(1) For Chinese IE, If you cancel the total sending in UTF-8 in advanced options, pathinfo and querystring are URL encode encoded according to GBK.
Actually, the submission is:
GET/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa
(3) For Firefox, both pathinfo and querystring are URL encode encoded according to GBK.
Actually, the submission is:
GET/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa
Obviously, different browsers and different settings of the same browser will affect the pathinfo encoding in the final URL. Both IE and Firefox in Chinese use GBK-encoded querystring.
Summary: solution:
1. If the URL contains non-ASCII characters such as Chinese characters, the browser will urlencode them. To prevent browsers from using the encoding we do not want, it is best not to directly use non-ASCII characters in the URL, but use the URL encoded string %.
For example:
URL: http: // localhost: 8080/example/China? Name = China
Suggestion:
URL: http: /localhost: 8080/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa
2. We recommend that pathinfo and querystring in the URL adopt the same encoding, which makes processing on the server easier.
2. Another problem is that many programmers do not understand that the URL encode requires a character set. Don't understand people can look at this document: http://gceclub.sun.com.cn/Java_Docs/html/zh_CN/api/java/net/URLEncoder.html
2. Post submission
In post mode, the parameter value pairs in the form are sent to the server through the request body. In this case, the browser will send the request to the server based on the contenttype ("text/html; charset = GBK") of the webpage ") and then send it to the server.
In the server-side program, we can set the encoding through request. setcharacterencoding (), and then get the correct data through request. getparameter.
Solution:
1. In terms of simplicity and minimum cost, it is more appropriate for us to use uniform encoding for URL and webpage encoding.
If unified encoding is not used, we need to perform some encoding conversion tasks in the program. This is why we have seen a lot of information on the Internet about how to handle Garbled text. Many of the solutions are just a matter of time, and they have not fundamentally solved the problem.
(2) servlet Server
The servlet implemented by the servlet server encounters a string containing % in the data submitted by the URL and post. It will be decoded according to the specified character set. The results returned by the following two servlet methods are decoded:
Request. getparameter ("name ");
Request. getpathinfo ();
The "specified character set" is configured in the configuration file of the application server.
(1) Tomcat server
For the Tomcat server, the file is server. xml
<Connector Port = "8080" protocol = "HTTP/1.1"
Maxthreads = "150" connectiontimeout = "20000"
Redirectport = "8443" uriencoding = "GBK"/>
Uriencoding tells the server servlet the encoding used for URL Decoding.
<Connector Port = "8080"... usebodyencodingforuri = "true"/>
Usebodyencodingforuri indicates that the server uses the encoding specified by the request body to decode the URL.
(2) WebLogic Server
For the WebLogic Server, the file is weblogic. xml
<Input-charset>
<Java-charset-Name> GBK </Java-charset-Name>
</Input-charset>
(3) Browser display
The browser decodes the byte stream sent by the server based on the specified character set of contenttype ("text/html; charset = GBK") in the HTTP header. We can call httpservletresponse. setcontenttype () to set the contenttype of the HTTP header.
Summary:
1. The encoding and decoding of pathinfo and querystring in the URL are determined by the configuration of the browser and Application Server. Our program cannot be set and we do not expect to use request. the setcharacterencoding () method can be used to set the character set for URL parameter value decoding.
Therefore, we recommend that you do not use non-ASCII characters such as Chinese characters in the URL. If the URL contains non-ASCII characters, use urlencode encoding. For example:
Http: // localhost: 8080/example1/example/China
Correct syntax:
HTTP: /localhost: 8080/example1/example/% E4 % B8 % ad % E5 % 9B % BD
We recommend that you do not use non-ASCII characters in both pathinfo and querystring in the URL, for example
Http: // localhost: 8080/example1/example/China? Name = China
The reason is simple: different browsers use different character sets for pathinfo and querystring encoding in URLs, but the application server uses the same character set for URL Decoding.
2. We recommend that the character set of URL encode in the URL be the same as the character set of the contenttype on the webpage, so that the implementation of the program is very simple and there is no need for complicated encoding conversion.
Now you should understand the URL encoding principles. Let's look at the three URLs provided at the beginning of this article. Http://hi.baidu.com/aobao's Yan belongs to pathinfo. Therefore, according to the default settings of IE and Firefox, ie uses utf8 encode for the URL, while Firefox uses GBK encode for the URL L2 and url3.