JSP coding problems

Source: Internet
Author: User
Tags examples of browsers url decode

Article Source: http://www.blogjava.net/freeman1984/archive/2010/06/01/322465.html

Source:Http://blog.csdn.net/yzhz 

I. Problems:
The coding problem is that Java beginners often encounter problems during web development. There are also a large number of related articles on the internet, however, many of the articles do not provide an accurate explanation and description of the garbled characters in the backend program parsing caused by the use of non-ASCII characters such as Chinese characters in the URL. This article describes in detail the garbled characters caused by the use of non-ASCII characters such as Chinese Characters in URLs.

1. The Chinese characters in the URL usually appear in the following two places:
(1), the parameter value in the query string, such as http://search.china.alibaba.com/search/offer_search.htm? KEYWORDS = China
(2) Servlet Path, for example: http://search.china.alibaba.com/selloffer/chinese .html


2. The main reasons for Garbled text are as follows:
(1) Browser: our client (browser) itself does not follow the URI encoding specification (http://www.w3.org/International/O-URL-code.html ).
(2) servlet server: The servlet server is not correctly configured.
(3) Developers do not understand servlet specifications and API meanings.

Ii. Basic Knowledge:
1. Several steps of an HTTP request:
Browser (ie Firefox) [Get/post] ------------> servlet server -------------------------------> Browser display
Encode and decode the displayed content to Unicode.
(1) The browser encodes the URL (and the content submitted by post) and sends it to the server.
(2) The servlet server actually refers to implementing servletrequestwrapper by the servlet provided by the servlet server. The servlet implementations of different application servers are different. These servlet implementations decodes and converts the content to Unicode, after processing is complete, return the result (web page) encoding to the browser.
(3) the browser displays the webpage according to the specified code.

Character sets are involved when encoding and decoding strings, typically used for ISO8859-1, GBK, UTF-8, Unicode.


2. url composition:
Domain Name: Port/contextpath/servletpath/pathinfo? Querystring
Note:

1. contextpath is specified in the servlet server configuration file.
For WebLogic:
Contextpath is configured in weblogic. xml of the application.
<Context-root>/</context-root>
 
For Tomcat:
Contextpath is configured in server. xml.
<Context Path = "/" docbase = "D:/Server/blog. War" DEBUG = "5" reloadable = "true" crosscontext = "true"/>

For jboos:
Contextpath is configured in the jboss-web.xml of the application.
<JBoss-web>
<Context-root>/</context-root>
</JBoss-web>

2. servletpath is configured in the web. xml of the application.
<Servlet-mapping>
<Servlet-Name> example </servlet-Name>
<URL-pattern>/example/* </url-pattern>
</Servlet-mapping>

2. servlet API
Use the following servlet API to obtain URL values and parameters.
Request. getparameter ("name"); // obtain the querystring parameter value (from get and post), whose value passes through the URL Decode of the servlet Server
Request. getpathinfo (); // note: the string returned by pathinfo passes through the URL Decode of the servlet server.
Requesturi = request. getrequesturi (); // The content is the raw data submitted by the contextpath/servletpath/pathinfo browser, which has not been decode by the servlet server URL.


3. Developers must understand servlet specifications:
(1) The httpservletrequest. setcharacterencoding () method is only applicable to setting the encoding of the Request body submitted by post instead of the querystring encoding submitted by the get method. This method tells the application server what encoding should be used to parse the post-uploaded content. This is not described in many articles.
(2) The result returned by httpservletrequest. getpathinfo () is decoded by the servlet server (decode.
(3) The string returned by httpservletrequest. getrequesturi () has not been decoded by the servlet server.
(4) data submitted by post is part of the Request body.
(5) The role of contenttype ("text/html; charset = GBK") in the HTTP header of a webpage:
(A) Tell the browser what encoding the data on the webpage is;
(B) When submitting a form, usually the browser will encode the data in the form according to the charset specified by contenttype and then send it to the server.
Note that contenttype refers to the contenttype of the HTTP header, rather than the contenttype in Meta on the webpage.


3. The following are examples of browsers and application servers:
URL: http: // localhost: 8080/example/China? Name = China
Binary representation of Chinese Characters
China UTF-8 0xe4 0xb8 0xad 0xe5 0x9b 0xbd [-28,-72,-83,-27,-101,-67]
China GBK 0xd6 0xd0 0xb9 0xfa [-42,-48,-71,-6]
China ISO8859-1 0x3f, 0x3f [63, 63] information loss


(1) Browser
1. When the get method is submitted, the browser will perform URL encode on the URL and then send it to the server.
(1) For Chinese IE, if you select in advanced options to always send in UTF-8 (by default), pathinfo is URL encode is encoded by UTF-8, querystring is encoded by GBK.
Http: // localhost: 8080/example/China? Name = China
Actually, the submission is:
GET/example/% E4 % B8 % ad % E5 % 9B % BD? Name = % D6 % D0 % B9 % fa

(1) For Chinese IE, If you cancel the total sending in UTF-8 in advanced options, pathinfo and querystring are URL encode encoded according to GBK.
Actually, the submission is:
GET/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa

(3) For Firefox, both pathinfo and querystring are URL encode encoded according to GBK.
Actually, the submission is:
GET/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa

Obviously, different browsers and different settings of the same browser will affect the pathinfo encoding in the final URL. Both IE and Firefox in Chinese use GBK-encoded querystring.

Summary: solution:
1. If the URL contains non-ASCII characters such as Chinese characters, the browser will urlencode them. To prevent browsers from using the encoding we do not want, it is best not to directly use non-ASCII characters in the URL, but use the URL encoded string %.
For example:
URL: http: // localhost: 8080/example/China? Name = China
Suggestion:
URL: http: /localhost: 8080/example/% D6 % D0 % B9 % fa? Name = % D6 % D0 % B9 % fa

2. We recommend that pathinfo and querystring in the URL adopt the same encoding, which makes processing on the server easier.

2. Another problem is that many programmers do not understand that the URL encode requires a character set. Don't understand people can look at this document: http://gceclub.sun.com.cn/Java_Docs/html/zh_CN/api/java/net/URLEncoder.html

2. Post submission
In post mode, the parameter value pairs in the form are sent to the server through the request body. In this case, the browser will send the request to the server based on the contenttype ("text/html; charset = GBK") of the webpage ") and then send it to the server.
In the server-side program, we can set the encoding through request. setcharacterencoding (), and then get the correct data through request. getparameter.

Solution:
1. In terms of simplicity and minimum cost, it is more appropriate for us to use uniform encoding for URL and webpage encoding.
If unified encoding is not used, we need to perform some encoding conversion tasks in the program. This is why we have seen a lot of information on the Internet about how to handle Garbled text. Many of the solutions are just a matter of time, and they have not fundamentally solved the problem.


(2) servlet Server
The servlet implemented by the servlet server encounters a string containing % in the data submitted by the URL and post. It will be decoded according to the specified character set. The results returned by the following two servlet methods are decoded:
Request. getparameter ("name ");
Request. getpathinfo ();

The "specified character set" is configured in the configuration file of the application server.

(1) Tomcat server
For the Tomcat server, the file is server. xml
<Connector Port = "8080" protocol = "HTTP/1.1"
Maxthreads = "150" connectiontimeout = "20000"
Redirectport = "8443" uriencoding = "GBK"/>
Uriencoding tells the server servlet the encoding used for URL Decoding.

<Connector Port = "8080"... usebodyencodingforuri = "true"/>
Usebodyencodingforuri indicates that the server uses the encoding specified by the request body to decode the URL.

(2) WebLogic Server
For the WebLogic Server, the file is weblogic. xml
<Input-charset>
<Java-charset-Name> GBK </Java-charset-Name>
</Input-charset>

(3) Browser display
The browser decodes the byte stream sent by the server based on the specified character set of contenttype ("text/html; charset = GBK") in the HTTP header. We can call httpservletresponse. setcontenttype () to set the contenttype of the HTTP header.

Summary:
1. The encoding and decoding of pathinfo and querystring in the URL are determined by the configuration of the browser and Application Server. Our program cannot be set and we do not expect to use request. the setcharacterencoding () method can be used to set the character set for URL parameter value decoding.
Therefore, we recommend that you do not use non-ASCII characters such as Chinese characters in the URL. If the URL contains non-ASCII characters, use urlencode encoding. For example:
Http: // localhost: 8080/example1/example/China
Correct syntax:
HTTP: /localhost: 8080/example1/example/% E4 % B8 % ad % E5 % 9B % BD
We recommend that you do not use non-ASCII characters in both pathinfo and querystring in the URL, for example
Http: // localhost: 8080/example1/example/China? Name = China
The reason is simple: different browsers use different character sets for pathinfo and querystring encoding in URLs, but the application server uses the same character set for URL Decoding.

2. We recommend that the character set of URL encode in the URL be the same as the character set of the contenttype on the webpage, so that the implementation of the program is very simple and there is no need for complicated encoding conversion. 

----- My own summary:

JSP encoding -----------------------------
1. pageencoding: only indicates the encoding format of the JSP page, which has nothing to do with the encoding displayed on the page;
When the container reads (file), (database), or (String constant), it converts it to Unicode used internally.
The page content is displayed after the internal Unicode is converted to the encoding specified by contenttype;
If the pageencoding attribute exists, the character encoding method of the JSP page is determined by pageencoding,
Otherwise, it is determined by the charset in the contenttype attribute. If the charset does not exist, the character encoding method of the JSP page will be used.
The default ISO-8859-1.
2. contenttype: Specifies the MIME type and the character encoding method for JSP page responses. The default value of the MIME type is "text/html ";
The default value of the character encoding method is "ISO-8859-1". The MIME type and the character encoding method are separated by semicolons;
3. Relationship between pageencoding and contenttype:
1. The content of pageencoding is only used for encoding during JSP output and will not be sent as a header. It tells the Web Server
The encoding of the JSP page, that is, the encoding of the response stream output by the Web server;
2. In the first stage, JSP is compiled into. Java, which reads the JSP according to the pageencoding setting. The result is translated by the specified encoding scheme.
Java source code (. Java ).
3. The second stage is the compilation of Java source code from javac to Java bytecode, No matter what encoding scheme is used in JSP writing,
After this stage the results are all UTF-8's encoding Java source code. javac read with UTF-8's Encoding
Java source code, compiled into the binary code of the UTF-8 encoding (namely. Class), which is the JVM constant string in the binary code
(Java encoding.
4. The third stage is the Java binary code from Tomcat (or its application container) load and execution phase 2,
The output result is displayed on the client. In this case, the parameter contenttype hidden in phase 1 and phase 2 is effective.
4. The setting method with the same effect as contenttype is also the charset, response. setcharacterencoding (),
Response. setcontenttype (), response. setheader (); response. setcontenttype (),
Response. setheader (); the highest priority, followed by response. setcharacterencoding ();
<% @ Page contenttype = "text/html; chareset = GBK" %> and <meta http-equiv = "Content-Type"
Content = "text/html; charset = gb2312"/>.
5. Web page input encoding: Specify the page input encoding when setting the page encoding <% @ page contenttype = "text/html; chareset = GBK" %>;
If the display of the page is set to a UTF-8, all the page inputs of the user are encoded according to the UTF-8; the server side program is reading
Set the input encoding before entering the form;
After the form is submitted, the browser converts the form field value to the byte value corresponding to the specified character set, and then according to the HTTP standard URL
The encoding scheme encodes the result byte, but the page must tell the server the encoding method of the current page;
Request. setcharacterencoding (), can modify the serverlet to get the Request Encoding, response. setcharacterencoding (),
Can modify the encoding of serverlet returned results.
6. The response. setcharacterencoding () and request. setcharacterencoding ("UTF-8") parameters in the get method form are invalid,
Get has a length limit of up to 2048 bytes (1024 Chinese characters)

Note that the encoding must be set before obtaining the write channel:
Response. setcontenttype ("text/html; charset = GBK ");
Printwriter out = response. getwriter ();
If it is written as follows, the encoding settings will be invalid:
Printwriter out = response. getwriter ();
Response. setcontenttype ("text/html; charset = GBK ");
Obtaining the level channel Before encoding will invalidate the later encoding settings.
System. Out. println ("997:" + request. getparameter ("ta "));
Request. setcharacterencoding ("UTF-8 ");
System. Out. println ("999:" + request. getparameter ("ta "));
Correct:
Request. setcharacterencoding ("UTF-8 ");
System. Out. println ("999:" + request. getparameter ("ta "));
----------------------------- 2 --

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.