Tomcat's default encoding settings and causes of garbled characters
Http://www.linuxso.com/architecture/20099.html
Generation of garbled characters
For example, the Chinese character "medium" is encoded with a UTF-8 to get a 3 byte value % E4 % B8 % ad, and then submit the 3 byte to the Tomcat container through get or post, if you don't tell Tomcat that my parameters are encoded in UTF-8, Tomcat thinks you're encoded in ISO-8859-1, while ISO8859-1 (compatible with standard character set US-ASCII in URI) is ascii-compatible single-byte encoding and uses all the space within a single byte, so Tomcat will assume that you passed the three characters encoded with the ISO-8859-1 character set, then it is decoded with a ISO-8859-1, get ?? --, After decoding. String ?? -- The JVM is in unicode format, while HTTP transmission or database stores bytes. Therefore, you can set the Unicode string according to the needs of various terminals ?? -- Get the corresponding bytes After encoding with the UTF-8 stored in the database (3 UTF-8 characters), you can also get the 3 bytes of the ISO-8859-1 corresponding to the 3 characters, then re-encoding with the UTF-8 to get the Unicode Character "medium" (feature: treat any other encoded byte stream as ISO-8859-1 encoding is okay ), then, use response to pass it to the client (the transmitted bytes are also different depending on the Content-Type you set !)
Summary:
1. What are the bytes transmitted by http get or post? The database also stores bytes (for example, MB space is MB)
2. garbled characters are produced by different encoding and decoding character sets (methods), that is, for several different bytes, the corresponding characters may be different under different encoding schemes, some bytes may not exist in some encoding (this is also garbled? Cause)
3. The decoded string exists in unicode format in JVM.
4. If the Unicode characters in the JVM are your expected characters (the encoding and decoding character sets are the same or compatible), there is no problem, if the character set in the JVM is not as expected, for example, in the above example, the JVM contains three Unicode characters, you can also get the three bytes corresponding to the three Unicode characters, and then encode the three bytes with the UTF-8 to generate a new Unicode Character: Chinese character "medium"
5, the ISO8859-1 is ascii-compatible single-byte encoding and uses all space within a single byte, And the byte stream that transfers and stores any other encoding in systems that support ISO-8859-1 will not be discarded. In other words, it is okay to treat any other encoded byte stream as ISO-8859-1 encoding.
The following code shows that encoder with different encoding will get different results, and if encoder and decoder are inconsistent or the Chinese character used does not exist in the encoding ISO-8859-1, all are garbled!
Try {
// UTF-8 encoded in Chinese characters as % E4 % B8 % AD (3 bytes)
System. Out. println (urlencoder. encode ("medium", "UTF-8 "));
// Chinese characters encoded in UTF-8 as % 3f (1 byte this is because Chinese characters do not exist in the ISO-8859-1 Character Set, what is returned? Encoding in ISO-8859-1)
System. Out. println (urlencoder. encode ("medium", "ISO-8859-1 "));
// UTF-8 encoded in Chinese characters as % D6 % D0 (2 bytes)
System. Out. println (urlencoder. encode ("medium", "gb2312 "));
// The corresponding UTF-8 code % E4 % B8 % ad in Chinese characters is decoded by UTF-8 to get normal Chinese Characters
System. Out. println (urldecoder. Decode ("% E4 % B8 % ad", "UTF-8 "));
// The corresponding ISO-8859-1 encoding % 3f in Chinese characters are decoded by ISO-8859-1?
System. Out. println (urldecoder. Decode ("% 3f", "ISO-8859-1 "));
// Decodes the gb2312 code % D6 % D0 in the Chinese character to get the normal Chinese Character
System. Out. println (urldecoder. Decode ("% D6 % D0", "gb2312 "));
// Decodes the corresponding UTF-8 in the Chinese characters % E4 % B8 % ad with a ISO-8859-1
// Get the character ?? -(This is the so-called garbled code, in fact it is 3 bytes % E4 % B8 % ad each byte corresponds to the characters in the ISO-8859-1)
// The ISO-8859-1 Character Set uses all spaces in a single byte
System. Out. println (urldecoder. Decode ("% E4 % B8 % ad", "ISO-8859-1 "));
// Decodes the corresponding UTF-8 code % E4 % B8 % ad in Chinese characters using gb2312
// Get the character Juan ?, Because the gb2312 character corresponding to the first 2 bytes % E4 % B8 is Juan, and the 3rd byte % ad does not exist in the gb2312 encoding, the returned?
System. Out. println (urldecoder. Decode ("% E4 % B8 % ad", "gb2312 "));
} Catch (unsupportedencodingexception e ){
// Todo auto-generated Catch Block
E. printstacktrace ();
}
Tomcat's default settings for encoding and related standards:
For GET requests, the "URI Syntax" specification requires that HTTP query strings (also known as get parameters) Use US-ASCII encoding and that all characters outside the encoding range must be transcoded regularly: % 61 format (encode ). Since ISO-8859-1 and ASCII are compatible with characters in the 0x20 to 0x7e range, most web containers, such as Tomcat containers, use the ISO-8859-1 by default to decode the bytes of % XX in the URI. You can use uriencoding in ctor to modify the default character set used to decode the % xx Part Of The URI. Uriencoding and get request Query
The encoding of encode in string is always, or you can set Content-Type to tell the container what encoding you use to transcode the characters in the URL.
POST requests should specify the encoding they use by themselves through the Content-Type parameter, because many clients do not set a specific encoding, Tomcat uses ISO-8859-1 encoding by default. Note: The difference between the character set used for Uri decoding, request character set, and response character set! In different request implementations, the relationships between the preceding three encodings are different.
For post requests, the ISO-8859-1 is the default encoding of the HTTP request and response defined in the servlet specification. If the character set of the request or response is not set, the servlet specification specifies that the encoding ISO-8859-1 is used, and the request and the corresponding specified encoding are set through the Content-Type response header.
Tomcat uses ISO-8859-1 encoding by default if the get and post requests do not use Content-Type encoding. You can use setcharacterencodingfilter to modify the default encoding settings of Tomcat requests (encoding: encoding used, ignore: True), whether or not the client specifies the encoding settings, false, encoding is set only when no encoding is specified on the client. The default value is true)
Note: Generally, this filter is recommended to be placed at the beginning of all filters (before servlet3.0, it is based on filter-mapping on the web. the order in XML, which can be specified by parameters after servlet3.0), because once the value is set from the request, the setting is invalid. For the first time, Tomcat will convert the variables submitted in querystring or post mode from the request value to the parameters array using the specified encoding. Then, the corresponding parameter values will be obtained directly from this array!
UTF-8 is recommended everywhere:
1, set uriencoding = "UTF-8" on your <connector> in server. xml. enables Tomcat http get requests to use UTF-8 Encoding
2, use a character encoding filter with the default encoding set to UTF-8. Because many requests themselves do not specify the encoding, Tomcat uses the ISO-8859-1 encoding by default as the encoding of httpservletrequest, modified by the filter
3. Change all your JSPs to include charset name in their contenttype. for example, use <% @ page contenttype = "text/html; charset = UTF-8" %> for the usual JSP pages and <JSP: Directive. page contenttype = "text/html; charset = UTF-8"/> for the pages in XML syntax (aka
JSP documents). specifies the encoding used by the JSP page.
4, change all your Servlets to set the content type for responses and to include charset name in the content type to be UTF-8. use response. setcontenttype ("text/html; charset = UTF-8") or response. setcharacterencoding ("UTF-8 "). sets the encoding of response returned results.
5, change any content-generation libraries you use (velocity, freemarker, etc .) to use UTF-8 and to specify UTF-8 In the content type of the responses that they generate. specify the encoding of all template Engines
6, disable any valves or filters that may read request parameters before your character encoding filter or JSP page has a chance to set the encoding to UTF-8. setcharacterencodingfilter is generally placed in the first place, otherwise it may be invalid
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* Contributor license agreements. See the notice file distributed
* This work for additional information regarding copyright ownership.
* The ASF licenses this file to you under the Apache license, version 2.0
* (The "License"); you may not use this file except T in compliance
* The license. You may be obtain a copy of the license
*
* Http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* Distributed under the license is distributed on an "as is" basis,
* Without warranties or conditions of any kind, either express or implied.
* See the license for the specific language governing permissions and
* Limitations under the license.
*/
Package filters;
Import java. Io. ioexception;
Import javax. servlet. filter;
Import javax. servlet. filterchain;
Import javax. servlet. filterconfig;
Import javax. servlet. servletexception;
Import javax. servlet. servletrequest;
Import javax. servlet. servletresponse;
/**
* <P> example filter that sets the character encoding to be used in parsing
* Incoming request, either unconditionally or only if the client did not
* Specify a character encoding. configuration of this filter is based on
* The following initialization parameters: </P>
* <Ul>
* <Li> <strong> encoding </strong>-the character encoding to be configured
* For this request, either conditionally or unconditionally based on
* The <code> ignore </code> initialization parameter. This Parameter
* Is required, so there is no default. </LI>
* <Li> <strong> ignore </strong>-if set to "true", any character encoding
* Specified by the client is ignored, and the value returned by
* <Code> selectencoding () </code> method is set. If set to "false,
* <Code> selectencoding () </code> is called <strong> only </strong> If
* Client has not already specified an encoding. By default, this
* Parameter is set to "true". </LI>
* </Ul>
*
* <P> although this filter can be used unchanged, it is also easy
* Subclass it and make the <code> selectencoding () </code> method more
* Intelligent about what encoding to choose, based on characteristics
* The incoming request (such as the values of the <code> Accept-language </code>
* And <code> User-Agent </code> headers, or a value stashed in the current
* User's session. </P>
*
* @ Author Craig Mcclanahan
* @ Version $ ID: setcharacterencodingfilter. Java 939521 2010-04-30 00: 16: 33z kkolinko $
*/
Public class setcharacterencodingfilter implements filter {
// --------------------------------------------------------- Instance variables
/**
* The default character encoding to set for requests that pass through
* This filter.
*/
Protected string encoding = NULL;
/**
* The filter configuration object we are associated with. If this value
* Is null, this filter instance is not currently configured.
*/
Protected filterconfig = NULL;
/**
* Shoshould a character encoding specified by the client be ignored?
*/
Protected Boolean ignore = true;
// ----------------------------------------------------------- Public methods
/**
* Take this filter out of service.
*/
Public void destroy (){
This. Encoding = NULL;
This. filterconfig = NULL;
}
/**
* Select and set (if specified) the character encoding TO BE USED
* Interpret Request Parameters for this request.
*
* @ Param request the Servlet request we are processing
* @ Param result the servlet response we are creating
* @ Param chain the filter chain we are processing
*
* @ Exception ioexception if an input/output error occurs
* @ Exception servletexception if a servlet error occurs
*/
Public void dofilter (servletrequest request, servletresponse response,
Filterchain chain)
Throws ioexception, servletexception {
// Conditionally select and set the character encoding to be used
If (ignore | (request. getcharacterencoding () = NULL )){
String encoding = selectencoding (request );
If (encoding! = NULL)
Request. setcharacterencoding (encoding );
}
// Pass control on to the next Filter
Chain. dofilter (request, response );
}
/**
* Place this filter into service.
*
* @ Param filterconfig the filter configuration object
*/
Public void Init (filterconfig) throws servletexception {
This. filterconfig = filterconfig;
This. Encoding = filterconfig. getinitparameter ("encoding ");
String value = filterconfig. getinitparameter ("Ignore ");
If (value = NULL)
This. Ignore = true;
Else if (value. inclusignorecase ("true "))
This. Ignore = true;
Else if (value. inclusignorecase ("yes "))
This. Ignore = true;
Else
This. Ignore = false;
}
// ---------------------------------------------------------- Protected Methods
/**
* Select an appropriate character encoding to be used, based on
* Characteristics of the current request and/or filter Initialization
* Parameters. If no character encoding shoshould be set, return
* <Code> null </code>.
* <P>
* The default implementation unconditionally returns the value configured
* By the <strong> encoding </strong> initialization parameter for this
* Filter.
*
* @ Param request the Servlet request we are processing
*/
Protected string selectencoding (servletrequest request ){
Return (this. Encoding );
}
}
Reference: Tomcat wiki FAQ character encoding issues
Apache Tomcat configuration reference-the HTTP connectorhttp: // www.linuxso.com/ubunture/20099.html
Tomcat Chinese processing (2)
Http://zhangguyou2009.blog.163.com/blog/static/34691638200711810485391/
Tomcat Chinese processing (2)
10:48:05 | category: Tomcat
| Font size
Subscription
In the previous article, we introduced how Tomcat codes the received characters. Now let's see what happened when I wrote HTML documents to the client?
When writing data to the client, using the response output stream.
But how does JSP use the response stream?
When JSP is used to contain an out response, out is an object instance of the jspwriter implementation class. jspwriterimpl (servletresponse response, int SZ, Boolean autoflush) is a constructor of this class. It uses response and there is a Java in jspwriterimpl. io. when using jspwriter (JSP out object) to write data, the following function is called to initialize the writer object instance.
Protected void initout () throws ioexception
{
If (out = NULL)
{
Out = response. getwriter (); // initialize the java. Io. Writer object
}
} To initialize the internal object.
Then, in the implementation of the functions of each output data of jspwriter, the following java. Io. Writer object method is used by the handler.
Therefore, when JSP or servlet writes HTML to the client, the response stream is obtained through response. getwriter (); or the binary stream is obtained by getoutputstream.
A response has a response stream and a binary stream, but only one stream can be opened at a time. As for the relationship between the two, we will introduce them later. The out object of JSP is the response stream of response.
In the same request, there is also a stream and a binary stream, but only one stream can be opened at the same time.
Relationship between two response streams
The following describes how to implement the getoutputstream () and getwriter functions of the response implementation class:
Public servletoutputstream getoutputstream () throws ioexception
{
.....................
Stream = createoutputstream (); // create a binary output stream of response
................................
Return stream;
}
Public printwriter getwriter () throws ioexception
{
......................................
Responsestream newstream = (responsestream) createoutputstream (); // create a binary stream
................................
Outputstreamwriter OSR = new outputstreamwriter (newstream, getcharacterencoding ());
Writer = new responsewriter (OSR, newstream); // get the response character output stream
..........................
}
}
Obviously, our compaction stream is converted from a binary stream.
Note the following two functions:
Public String getcharacterencoding () ///// response encoding, the default is the ISO-8859-1
{
If (encoding = NULL) /// // if no encoding is specified
{
Return "ISO-8859-1 ";
} Else
{
Return encoding;
}
}
Public void setcontenttype (string type); set the response type and encoding.
{
.............
Encoding = requestutil. parsecharacterencoding (type); // get the specified Encoding
If (encoding = NULL)
{
Encoding = "ISO-8859-1"; // if no encoding method is specified
}
} Else
If (encoding! = NULL)
{
Contenttype = type + "; charset =" + encoding;
}
}
Now we know the response stream (whether JSP or servlet) of response when writing characters, that is, the outputstreamwriter OSR = new outputstreamwriter (newstream, getcharacterencoding ());
Note that newstream is the implementation of the response binary stream.
So we have to look at the outputstreamwriter implementation:
Test the source code of outputstreamwriter. It has a streamencoder-type object that relies on him to convert the encoding;
Streamencoder is provided by Sun, which has
Public static streamencoder foroutputstreamwriter (outputstream, object OBJ, string s) to obtain the streamencoder object instance.
For JSP, when constructing a servlet, The outputstream parameter is the binary stream of response, obj is the outputstreamwriter object, and s is the name of the encoding method. In fact, we get an object instance that is a subclass of streamencoder,
Return new charsetse (outputstream, OBJ, charset. forname (S1); charsetse is a subclass of streamencoder.
He has the following function to implement encoding conversion:
Void implwrite (char AC [], int I, Int J) throws ioexception // AC is the char Data Group to extract the string
{
Charbuffer = charbuffer. Wrap (AC, I, j );
.......................
Coderresult = encoder. encode (charbuffer, BB, false); // BB is a bytebuffer that stores the encoded byte buffer.
..................................
Writebytes (); ////// // convert BB to byte array and write it to response in the binary stream
...............................
}
So far, we have learned about the encoding conversion process behind tomcat.