CharSet Coding Basics
CharSet full name character encoding or character set encoding. CharSet is an algorithm that converts characters (characters) into bytes (bytes) or converts bytes to characters. Unicode is used internally in Java to represent one character. The process of converting Unicode characters to bytes is called encoding, and the process of restoring bytes to Unicode characters is called decoding.
The request parameter that the browser sends to the Web application is expressed as a byte stream. The request parameter must be decoded to be read by the Java program. The Charset used to decode the request parameter is called the input character set encoding (inputs Charset);
The response response content returned to the browser by the Web application must be encoded into a byte stream to be interpreted by the browser or client. The Charset used to encode the response content is called the output character set encoding (outputs Charset).
In general, the input charset and output charset are the same. Because the browser sends the form data, it is always encoded with the charset of the current page. For example, there is a form page, it's "contenttype=text/html;" CHARSET=GBK ", when a user fills out a full form and submits it, the browser encodes the form data entered by the user in GBK. If the input charset and output charset are not the same, the server cannot correctly decode the form data that the browser sends back to the Web application based on the output charset.
There are some exceptions, however, where the input and output charset may be different:
Forms that are sent through Java script are always encoded with UTF-8. This means that you must use UTF-8 as input charset to decode the parameters correctly. In this way, unless output CharSet is also UTF-8, the two are different.
When using HTTP access between applications, different encodings may be used. For example, applying a to UTF-8 access to application B, and applying B to GBK as Input/output CharSet. A parameter decoding error is generated at this time.
Enter the URL that contains the parameter directly in the browser address bar, depending on the browser and operating system settings, there will be different results:
For example, in Chinese windows, both IE and Firefox have been tested by default to encode parameters with GBK. ie for direct input parameters, even the URL encoding did not do.
And in the Mac system, whether Safari or Firefox, by experiment, the default is to encode parameters with UTF-8. the relationship between locale and CharSet
Locale and CharSet are relatively independent of the two parameters, but have a certain relationship.
Locale determines the language of the text to be displayed, and charset encodes the text of the language into bytes or decoding it from bytes to text. Therefore, CharSet must be able to cover the language of the locale represented, if not, it may appear garbled. The following table lists some combinations of locale and charset:
the relationship between locale and CharSet
Locale |
English character set |
Chinese character set |
full character set |
iso-8859-1 |
GB2312 |
Big5 |
GBK |
GB18030 |
UTF-8 |
en_US (American English) |
√ |
√ |
√ |
√ |
√ |
√ |
ZH_CN (Simplified Chinese) |
|
√ |
|
√ |
√ |
√ |
ZH_TW, Zh_hk (Taiwanese Chinese, Hong Kong Chinese) |
|
|
√ |
√ |
√ |
√ |
In all charset, there are several "almighty" encodings: UTF-8
All the characters in Unicode are covered. However, when using UTF-8 to encode a Chinese-dominated page, each Chinese language takes up 3 bytes. It is recommended to use UTF-8 encoding for pages that are not Chinese based. GB18030
The Chinese international standard, like UTF-8, covers all the characters in Unicode. Using GB18030 to encode Chinese-dominated pages has some advantages, because most common Chinese use only 2 bytes, 1/3 shorter than UTF-8. However, GB18030 in non-Chinese operating systems may not be recognized, its versatility is not as good as UTF-8. Therefore, only the Chinese-dominated pages are recommended for use with GB18030 encoding. GBK
Strictly speaking, GBK is not an omnipotent code (for example, many Western European characters are not supported), nor international standards. But the number of characters it supports is close to GB18030. set up locale and CharSet
In the Servlet API, the following APIs are related to locale and CharSet.
locale, CharSet-related servlet APIs
HttpServletRequest |
. Getcharacterencoding () |
Read input encoding |
|
. setcharacterencoding (CharSet) |
Set input encoding |
Must be set before the first call to Request.getparameter () and Request.getparametermap (), otherwise it is invalid. If not set, the parameter is decoded by default with Iso-8859-1. generally only affects the decoding of Post request parameters |
. GetLocale () |
Get the preferred locale for browsers in Accept-language |
|
. Getlocales () |
Gets the locales specified in all Accept-language |
|
HttpServletResponse |
. Getcharacterencoding () |
Get output encoding |
|
. setcharacterencoding (CharSet) |
Set Output encoding |
Since Servlet 2.4 |
. getContentType () |
Get the Content Type |
Since Servlet 2.4 |
. setContentType (ContentType) |
Set Content Type |
The Content type may contain charset definitions, such as: text/html; Charset=gbk |
. GetLocale () |
Get output locale |
|
. SetLocale (Locale) |
Set Output locale |
Must be invoked before response is commit, otherwise it is invalid. It also sets charset unless the content type has been set and contains the definition of CharSet. |
Setting up locale and CharSet is something that looks easy and does not work as easily:
The input encoding must be set before the first call to read the request parameter, otherwise it will not be valid.
The only way to set output parameters before Servlet 2.3 is by setting the content type with the charset definition. This has been improved after servlet 2.4, adding a separate way to set the output encoding. to parse the parameters of a GET request
A GET request is the simplest way to request it. Its parameters are included in the URL in a URL-encoded manner. When you typed "Http://localhost:8081/user/login.htm?name=%E5%90%8D%E5%AD%97&password=password" in the browser's address bar, The browser will send the following HTTP request to the localhost:8081 server:
Get/user/login.htm?name=%e5%90%8d%e5%ad%97&password=password http/1.1
host:localhost:8081
The parameters in a GET request are encoded in a application/x-www-form-urlencoded manner and a specific charset. If the charset used to encode the URL parameter differs from the default charset of the application, you must specify the charset by special parameters.
Get/user/login.htm?_input_charset=utf-8&name=%e5%90%8d%e5%ad%97&password=password HTTP/1.1
However, the above request is in a different servlet engine, resulting in an indeterminate result. What the hell is going on here?
Originally, although the Request.setcharacterencoding (charset) method was invoked to set the input charset encoding, this setting only takes effect on the request content according to the Servlet API specification, The URL does not take effect. In other words, the request.setcharacterencoding (CharSet) method can only be used to parse the parameters of a POST request, not the parameters of a GET request.
So, what should you do with the parameters of a GET request? According to the URL specification, us-ascii characters in the URL must be encoded based on UTF-8 URL. In reality, however, no one has fully complied with these specifications from the browser to the server, causing some confusion. The current application server side, we have encountered, there are several different decoding schemes:
the logic for the server to decode parameters
server |
decoded logic |
Tomcat 4 |
Decodes the get parameter based on the value set by Request.setcharacterencoding (charset); If CharSet is not specifically specified, the default is |