Why the Web project is garbled

Source: Internet
Author: User
Tags locale
CharSet Coding Basics

CharSet full name character encoding or character set encoding. CharSet is an algorithm that converts characters (characters) into bytes (bytes) or converts bytes to characters. Unicode is used internally in Java to represent one character. The process of converting Unicode characters to bytes is called encoding, and the process of restoring bytes to Unicode characters is called decoding.

The request parameter that the browser sends to the Web application is expressed as a byte stream. The request parameter must be decoded to be read by the Java program. The Charset used to decode the request parameter is called the input character set encoding (inputs Charset);

The response response content returned to the browser by the Web application must be encoded into a byte stream to be interpreted by the browser or client. The Charset used to encode the response content is called the output character set encoding (outputs Charset).

In general, the input charset and output charset are the same. Because the browser sends the form data, it is always encoded with the charset of the current page. For example, there is a form page, it's "contenttype=text/html;" CHARSET=GBK ", when a user fills out a full form and submits it, the browser encodes the form data entered by the user in GBK. If the input charset and output charset are not the same, the server cannot correctly decode the form data that the browser sends back to the Web application based on the output charset.

There are some exceptions, however, where the input and output charset may be different:

Forms that are sent through Java script are always encoded with UTF-8. This means that you must use UTF-8 as input charset to decode the parameters correctly. In this way, unless output CharSet is also UTF-8, the two are different.

When using HTTP access between applications, different encodings may be used. For example, applying a to UTF-8 access to application B, and applying B to GBK as Input/output CharSet. A parameter decoding error is generated at this time.

Enter the URL that contains the parameter directly in the browser address bar, depending on the browser and operating system settings, there will be different results:

For example, in Chinese windows, both IE and Firefox have been tested by default to encode parameters with GBK. ie for direct input parameters, even the URL encoding did not do.

And in the Mac system, whether Safari or Firefox, by experiment, the default is to encode parameters with UTF-8. the relationship between locale and CharSet

Locale and CharSet are relatively independent of the two parameters, but have a certain relationship.

Locale determines the language of the text to be displayed, and charset encodes the text of the language into bytes or decoding it from bytes to text. Therefore, CharSet must be able to cover the language of the locale represented, if not, it may appear garbled. The following table lists some combinations of locale and charset:

the relationship between locale and CharSet

Locale English character set Chinese character set full character set
iso-8859-1 GB2312 Big5 GBK GB18030 UTF-8
en_US (American English)
ZH_CN (Simplified Chinese)
ZH_TW, Zh_hk (Taiwanese Chinese, Hong Kong Chinese)

In all charset, there are several "almighty" encodings: UTF-8

All the characters in Unicode are covered. However, when using UTF-8 to encode a Chinese-dominated page, each Chinese language takes up 3 bytes. It is recommended to use UTF-8 encoding for pages that are not Chinese based. GB18030

The Chinese international standard, like UTF-8, covers all the characters in Unicode. Using GB18030 to encode Chinese-dominated pages has some advantages, because most common Chinese use only 2 bytes, 1/3 shorter than UTF-8. However, GB18030 in non-Chinese operating systems may not be recognized, its versatility is not as good as UTF-8. Therefore, only the Chinese-dominated pages are recommended for use with GB18030 encoding. GBK

Strictly speaking, GBK is not an omnipotent code (for example, many Western European characters are not supported), nor international standards. But the number of characters it supports is close to GB18030. set up locale and CharSet

In the Servlet API, the following APIs are related to locale and CharSet.

locale, CharSet-related servlet APIs

HttpServletRequest
. Getcharacterencoding () Read input encoding
. setcharacterencoding (CharSet) Set input encoding

Must be set before the first call to Request.getparameter () and Request.getparametermap (), otherwise it is invalid.

If not set, the parameter is decoded by default with Iso-8859-1.

generally only affects the decoding of Post request parameters

. GetLocale () Get the preferred locale for browsers in Accept-language
. Getlocales () Gets the locales specified in all Accept-language
HttpServletResponse
. Getcharacterencoding () Get output encoding
. setcharacterencoding (CharSet) Set Output encoding

Since Servlet 2.4

. getContentType () Get the Content Type

Since Servlet 2.4

. setContentType (ContentType) Set Content Type

The Content type may contain charset definitions, such as: text/html; Charset=gbk

. GetLocale () Get output locale
. SetLocale (Locale) Set Output locale

Must be invoked before response is commit, otherwise it is invalid.

It also sets charset unless the content type has been set and contains the definition of CharSet.

Setting up locale and CharSet is something that looks easy and does not work as easily:

The input encoding must be set before the first call to read the request parameter, otherwise it will not be valid.

The only way to set output parameters before Servlet 2.3 is by setting the content type with the charset definition. This has been improved after servlet 2.4, adding a separate way to set the output encoding. to parse the parameters of a GET request

A GET request is the simplest way to request it. Its parameters are included in the URL in a URL-encoded manner. When you typed "Http://localhost:8081/user/login.htm?name=%E5%90%8D%E5%AD%97&password=password" in the browser's address bar, The browser will send the following HTTP request to the localhost:8081 server:

Get/user/login.htm?name=%e5%90%8d%e5%ad%97&password=password http/1.1
host:localhost:8081

The parameters in a GET request are encoded in a application/x-www-form-urlencoded manner and a specific charset. If the charset used to encode the URL parameter differs from the default charset of the application, you must specify the charset by special parameters.

Get/user/login.htm?_input_charset=utf-8&name=%e5%90%8d%e5%ad%97&password=password HTTP/1.1

However, the above request is in a different servlet engine, resulting in an indeterminate result. What the hell is going on here?

Originally, although the Request.setcharacterencoding (charset) method was invoked to set the input charset encoding, this setting only takes effect on the request content according to the Servlet API specification, The URL does not take effect. In other words, the request.setcharacterencoding (CharSet) method can only be used to parse the parameters of a POST request, not the parameters of a GET request.

So, what should you do with the parameters of a GET request? According to the URL specification, us-ascii characters in the URL must be encoded based on UTF-8 URL. In reality, however, no one has fully complied with these specifications from the browser to the server, causing some confusion. The current application server side, we have encountered, there are several different decoding schemes:

the logic for the server to decode parameters

server decoded logic
Tomcat 4

Decodes the get parameter based on the value set by Request.setcharacterencoding (charset);

If CharSet is not specifically specified, the default is

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.