Http://blog.sina.com.cn/s/blog_87cb63e50102w2b6.html
*******************************
Basic concepts
There are three processes of encoding, transmitting and decoding that can be generated by exchanging information. Coding is the process of transforming information from one form into another, just as human language is encoded by the vocal cords and converted into sound waves. Decoding is the inverse function of the encoding, the eardrum receives sound waves, which are decoded through the cranial nerves into the information that human culture can understand.
A character set is a collection of all text symbols under a culture context, which specifies all the characters under a culture and how that character is represented under the information Exchange System, which is a sequence of bytes or 01 under a computer information system. This article uses the character set and encoding scheme for interoperability at some point in order to make it easier to understand.
For Javaweb applications, the narrow coding process can be easily understood as: The encoding process is the text string information encoded into 01 sequence, decoding is to restore the 01 sequence to text string information, specifically encoded into what kind of 01 sequence is coded by the character set to determine, that is, the coding scheme.
Garbled is the use of information encoding scheme is not understood, using the wrong coding scheme to decode the information caused. If you want to understand the true intent of a piece of information, you need to know the encoding scheme of information, this is the key to exchange information, this is why the war years to crack the other side of the Telegraph encryption method, in fact, is to decipher each other's coding scheme.
encoding and decoding of HTTP protocol layer
The character set of the HTTP protocol layer relates to what character sets the HTTP sender and pick-up scheme uses to resolve the content sent by the caller.
browser-side encoding
The request-side general request mode is mainly form, URL, Ajax, HTTP components such as HTTPCLIENTAPI.
There is a concept of document coding scheme CharSet in the browser, and the encoding scheme of the document is equivalent to the document decoding scheme, which has an impact on the request encoding that occurs in the document.
Factors that affect the encoding of the form submission data include: The Accept-charset property of the form, the encoding scheme of the HTML document, or Document.charset. Among them, whether the form of the accept-charset can be effective, depending on the implementation of specific browsers, some browsers do not support, such as IE. The document encoding scheme can be modified by Document.charset.
URL encoding within the document, such as the URL of the src specified by the IFRAME, according to the document encoding scheme, the Address bar URL encoding scheme is entirely dependent on the specific browser implementation, when sending requests through the HttpClient component, the URL can be arbitrarily specified encoding scheme.
Ajax send HTTP Request URL encoding method is entirely dependent on the browser implementation, general support for the document encoding scheme to determine, but the data body uniform adoption of utf-8, in addition, although Ajax can specify the header in the ContentType Description encoding scheme, but this practice does not on the URL , the encoding scheme of the data body has no effect, even in some browsers, the coding description in the final contenttype cannot really affect.
In addition, the header encoding scheme is iso-8859-1, this is the HTTP specification.
decoding of the service side
The httpserver to be decoded by the server are: header, URL, data body.
The header decoding scheme is iso-8859-1.
The URL decoding scheme is often referred to as uriencoding, and the general Httpserver provides the appropriate settings that the standard servlet does not provide. Jetty the default Utf-8 character set to decode, but other httpserver such as tomcat default iso-8859-1.
The data body decoding can be set by request.setcharacterencoding in the servlet. In general, some httpserver will determine the decoding scheme of the data body in order of precedence of the Characterencoding>request request header character set >utf-8.
Service-side encoding
The service side httpserver needs to encode the object is: header, data body.
The coding scheme for the header is also iso-8859-1.
Typically, the server must specify the encoding scheme to return the data body and to label the coding scheme in the header, otherwise the Httpserver general default iso-8859-1 to encode the output, and the browser will not know the return Data Body encoding scheme, only self-guessing, Rely entirely on the browser's own implementation.
Response.setcharacterencoding's function is to tell httpserver the encoding scheme of the data body, and it will not and should not affect the labeling of the coding scheme in the header. The response.setcontenttype affects the labeling of the header's encoding scheme, which is determined by the browser based on the identity. For a sound httpserver, the data volume coding scheme should be determined by the latter when the data volume coding scheme and the header coding scheme are specified by two methods, so that the encoding information obtained by the browser is consistent with the actual coding information on the service side. Also, it is important to note that the two methods of specifying the encoding scheme must be called before response creates the output stream, and once the output stream is created, the encoding scheme cannot be specified later.
browser-side decoding
The object that the browser side decodes to return includes: header, data body.
The decoding scheme of the header is iso-8859-1.
The browser's data body decoding scheme relies on the return information, the browser first from the header header to find the coding scheme callout, if there is no label, when the return content is HTML content, will be read from the head meta tag, if not found, the browser does not know how to decode, Will passively choose a decoding scheme.
In theory, the recommended HTML document is declared encoded in meta, and the encoded declaration must be done within 1024 bytes of the beginning of the file, so it is best to declare it immediately at the beginning of the head tag.
The document usually has some resource files downloaded through the URL, such as CSS and JS files, if the resource file output does not specify an explicit encoding scheme in the return header, the browser can not know the encoding scheme, only the document encoding scheme described above to decode, which is the best strategy for browser fault tolerance.
Encoding and decoding of browsers and services in Web apps