I. Coding and decoding involved in the Java Web
We all know that for Chinese, where I/O is involved in coding, I have already mentioned I/O operations can cause coding, and most I/O generated garbled is network I/O, since almost all applications now involve network operations, and data is transmitted over the network in bytes. So all data must be able to be serialized into bytes. In Java, the serializable interface must inherit the data to be serialized.
Here's a question, have you seriously considered a text of its actual size should be how to calculate, I have encountered a problem, is to find ways to compress the cookie size, reduce network traffic, there is a choice of compression algorithm, found that the number of characters after compression is reduced, but did not reduce the number of bytes. The so-called compression simply converts multiple single-byte characters into a multibyte character, reducing string.length () and not reducing the number of bytes in the final byte. For example, an integer number 1234567, if stored as a character, uses UTF-8 to encode 7 bytes, uses UTF-16 encoding to occupy 14 bytes, but takes it as an int to store only 4 bytes to store, so look at the size of a piece of text, It doesn't make sense to look at the length of the character itself, even if the same character uses a different encoding the final storage size will be different, so from character to byte you must look at the encoding type.
Another question, do you think about how it is expressed when we enter a character in a text editor in a computer. We all know that all the information in the computer is 0 and 1, then a Chinese character, he is exactly how many 0 and 1.
The characters we can see are in character form, for example, in Java "Taobao" two characters in the Computer numerical decimal is 28120 and 23453,16 6bd8 and 5d9d, that is, these two characters are only represented by these two numbers, A char in Java is 16 bits, which is equivalent to two bytes, so the two characters use char to represent space in memory equivalent to 4 bytes.
To get down to business, let's take a look at where the Java Web may be encoded.
The user initiates an HTTP request from the browser side, where the encoding is required Url,cookie,paramiter. When the server side receives an HTTP request to parse the HTTP protocol, where the Url,cookie and post form parameters need to be decoded, the server side may also need to read the data in the database--local or other parts of the network--that may have coding problems, When the servlet finishes processing all of the requested data, it needs to encode the data into the user-requested browser via the socket and then decode the text into the browser. The process is shown in Figure 3-1.
3.1URL Coding and decoding
The user submits a URL that may exist in Chinese, so it is necessary to encode and encode the URL. According to what rules to encode. And how to decode it. Figure 3-2 describes several components of a URL.
For example, Tomcat as a servlet engine, which corresponds to the following configuration files, the port corresponds to the <connector port= "8080"/> in Tomcat, and the context path in the < Context path= configuration in "/examples"/>, servlet Path is configured in <url-pattern> in the Web.xml of Web applications, PathInfo is the specific Servlet we request, QueryString is the parameter to pass. Note that the URL is entered directly in the browser, so it is requested by the Get method, and if it is requested by the Post method, QueryString will be submitted to the server side through the form. In the PathInfo and querystring parts of the Chinese language, when we enter the URL directly in the browser, how the browser and server to encode and parse the URL. Have you ever thought about it. To verify how the URL is encoded by the browser. We choose Google Browser to view the actual content of the request URL. The following is the request result requests url:https://localhost:8080/examples/servlets/servlet/%e5%8f%af%e4%b9%90?author=%e5%8f%af%e4%b9% 90, we found that COLA is encoded as%e5%8f%af%e4%b9%90 (different browsers use different encoding,), why there is%, because the browser-coded URL will non-ASCII characters encoded in a format of 16 into the number of bytes in each of the 16 in front of the byte plus%, So the final URL becomes this.
The character set that Tomcat decodes for the URI portion of the URL is defined in the Connector <connector urlencoding= "UTF-8" >, and if not defined, the default encoding is--iso-8859-1 parsed, So if the Chinese URL is recommended to set the urlencoding to UTF-8 encoding. How does that resolve to querystring? Get way HTTP request QueryString and Post methods The form parameters for HTTP requests are saved as parameters, and the parameter values are obtained by Request.getparameter. The decoding of them is done the first time the Request.getparameter method is invoked. The Org.apache.catalina.connector.Request Parseparameter method is invoked when the Request.getparameter method is invoked. This method will decode the parameters passed by the Get and post methods, but their decoded character sets may not be the same. For a GET request, the QueryString decoding character set is either the charset defined by the ContentType in the header. or the default ios-8859-1. To use the encoding defined in contenttype, Connectorde <connector urlencoding= "UTF-8" usebobyencodingforuri= "true"/> Usebobyencodingforuri is set to true, this setting is only valid for Qu Erystring.
From the above URL encoding and decoding process, more complex, and coding and decoding is not in our application can be fully controlled, so our application should try to avoid the use of non-ASCII characters in the URL, otherwise it is likely to encounter garbled problems.
3.2HTTP Header encoding and decoding
When a client initiates an HTTP request, other parameters, such as Cookie,rediectpath, may be passed in the header, in addition to the URL above, and the user-set values are likely to have coding problems, and how can tomcat decode them.
Decoding an item in the header is also done when calling Request.getheader, and if the requested header item does not decode the method that calls Messagebytes's ToString, this method will use the default encoding from byte to char conversion as well ios-8859- 1, and we can not set the header of the other decoding format, so you set the header has non-ASCII character decoding will be garbled. The same is true when we add headers, do not pass non-ASCII characters in the header, if you must pass, You can encode these characters in Org.apache.catalina.util.URLEncoder, and then add them to the header so that the browser's delivery to the server does not lose information, so we will be able to access the items and then decode them according to the corresponding character set.
Encoding and decoding of 3.3POST forms
As mentioned earlier, the decoding of the parameters submitted by the Post form occurs at the first call to Request.getparameter, and the post form parameter is passed to the server through the body of HTTP, and when we click the Submit button on the page, the browser will first be based on the ContentType's charset encoded format encodes the form-filled parameters and submits them to the server side, which is also decoded using the character set in ContentType, so that the parameters submitted by the Post form do not generally appear to be problematic. And this character set code is set by ourselves and can be set by Request.setcharacterencoding (CharSet).
Note that you must set up Request.setcharacterencoding (CharSet) before the first call to the Request.getparameter method, or your post form submission data may appear garbled.
Java code-end.
The article draws on "In-depth analysis of Java Web Insider"