Character encoding problems in Java EE

Source: Internet
Author: User
Tags recode

0 characterencodingfilter registered in Web. xml

<!--Configuring the Character set filter -<Filter>    <Filter-name>Encodingfilter</Filter-name>    <Filter-class>Org.springframework.web.filter.CharacterEncodingFilter</Filter-class>    <Init-param>        <Param-name>Encoding</Param-name>        <Param-value>UTF-8</Param-value>    </Init-param></Filter>

The above configuration is equivalent to Request.setcharacterencoding ("UTF-8") in the servlet.

1 POST Request

request.setcharacterencoding ("UTF-8") is not performed via the Jquery.ajax POST request, or it can be received in Chinese, I found that the direct request.getcharacterencoding received after the request was already UTF-8. View request header found containing charset=utf-8. It may be encoded in this way because the request header specifies the encoding method.

For a POST request to form form,request.getcharacterencoding gets null, so request.setcharacterencoding is required .

So it's best to set up request.setcharacterencoding ("UTF-8") to cope with different requests.

2 GET Request

However, the above setting method is for the POST request, Tomacat for Get and POST request processing method is different. In Tomcat5.0, the data submitted by the URL is re-encoded (decoded) by default using Iso-8859-1, and the data submitted by the Get method in the form is Instead of using the parameters in the request.getcharacterencoding, the data submitted by the URL is re-encoded (decoded) with the data submitted in the form by the Get method.

To work around this problem ( you can also convert it in Java background with GetBytes ), The Usebodyencodingforuri or uriencoding attribute should be set in the Connector tab of the Tomcat configuration file Server.xml, where Usebodyencodingforuri The parameter indicates whether the data submitted by the URL and the data submitted in the form are re-encoded with the request.setcharacterencoding parameter, which, by default, is False (this parameter defaults to true in Tomcat4.0). The uriencoding parameter specifies a uniform recoding (decoding) encoding for all get method requests, including data submitted by the URL and the Get method submitted in the form.

<connectiontimeout= "20000"  port= "8080"  protocol = "http/1.1" Redirectport = "8443" Usebodyencodingforuri = "true" />

The difference between uriencoding and Usebodyencodingforuri is that uriencoding is a uniform recoding (decoding) of all the data requested by the Get method, Usebodyencodingforuri is the re-encoding (decoding) of the data according to the request.setcharacterencoding parameter of the page that should be requested, and the different pages can have different encodings (decoding). So for the data submitted by the URL and the data that is submitted in the form, you can modify The uriencoding parameter is the browser encoding or modifying Usebodyencodingforuri to True (this time request.setcharacterencoding is null), and in the JSP page that gets the data The request.setcharacterencoding parameter is set to the browser encoding.

3 Encoding Sequence

for sending data, the server encodes the data to be sent in the order of precedence of the response.setcharacterencoding-contenttype-pageencoding.

1, pageencoding= "UTF-8" function is to set the JSP compiled into a servlet using the encoding.

2. The function of contenttype= "Text/html;charset=utf-8" is to specify the encoding to recode the server response.

3, the role of Request.setcharacterencoding ("UTF-8") is to set the encoding to re-encode the client request.

4. The role of response.setcharacterencoding ("UTF-8") is to specify the encoding to recode the server response. Also, the browser encodes (or decodes) the data it receives based on this parameter (charset, which is attached to the HTTP response header). The meta in HTML also has a charset, which works when saving local offline pages because there is no method header at this time.

It is important to note that the JSP file encoding: In the JSP standard syntax, if the pageencoding attribute exists, then the JSP page character encoding is determined by pageencoding, otherwise it is determined by the charset of the ContentType attribute, If CharSet also does not exist, the JSP page character encoding takes the default iso-8859-1 (the default encoding of the file can be configured in Eclipse itself).

4 Other

Java compile, the JVM according to the system default (our commonly used operating environment is eclipse or operating system, Eclipse's default encoding format can adjust itself; the Chinese operating system uses the GBK format by default) or according to the specified character set (javac–encoding XXX) Convert the source file into Unicode format stored in memory compilation (Unicode encoding in Java memory, two bytes per character). Post-compilation character data is stored in a bytecode file in Unicode format, resulting in a class file.

During the run, Java is also encoded in Unicode, and the default input and output are the default encoding for the operating system. System.getproperty ("file.encoding"); View system default encoding.

In front, we can use JS script to encode parameters: encodeURIComponent (), on the Java side, you can use Java.net.URLDecoder.decode to decode. But one thing to note here is that Tomcat will automatically decode the URL first, and we can see that in the Tomcat Udecoder class. But Tomcat did not use Urldecoder.decode, but instead wrote a decode function. Some articles on the internet have introduced a way to deal with garbled characters in JS in the parameters do two times encodeuricomponent, in Java to do a decode, you can solve some without setting uriencoding The garbled problem occurred. (explained as)

5 Unicode and UTF8

(1) ASCII code, 8 digits, 0 to 127 (including English letters, punctuation, etc.), 128 to 255 for extended ASCII

(2) gbk:16 bit, one Chinese character two bytes, one English byte

(3) Unicode (Uniform Code), encoding scheme, for each character in the world to set a unique encoding. Unicode is currently commonly used by UCS-2, which encodes a character in two bytes. Unicode also has the UCS-4 specification, which is to encode characters in 4 bytes.

(4) Utf8:utf-8 is a method of Unicode implementation, which is a variable length encoding. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol. , note that Unicode is a Chinese character that occupies 2 bytes, and UTF-8 has a Chinese character of 3 bytes. From Unicode to uft-8 is not a direct correspondence, but a number of algorithms and rules to convert. As shown in.

Example 1: The Unicode encoding of the word "Han" is 0x6c49. 0x6c49 between 0X0800-0XFFFF, using a 3-byte template:1110xxxx 10xxxxxx 10xxxxxx. The 0x6c49 is written as binary:0110 1100 0100 1001, which in turn replaces the X in the template with this bitstream, resulting in:11100110 10110001 10001001, or E6 B1 89

Character encoding problems in Java EE

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.