Parsing Web Development Coding issues

Source: Internet
Author: User

Parsing Web Development Coding issues

url:http://tcking.javaeye.com/blog/726643

In the Web development time often encounter garbled problem, garbled generally appear now:

1. The Chinese in the JSP file becomes garbled

2. The Chinese on the page becomes garbled

3. Backstage through Request.getparameter () garbled

Basic knowledge of coding

A computer can store and transfer information in bytes only, and a person needs to see a string, and the correspondence between bytes and strings is a character set, such as the character "medium" in bytes using the UTF-8 Character Set mapping: E4 B8 AD three bytes, which in turn, These three bytes can be "medium" by the UTF-8 character set mapping, different character set mapping rules are not the same, can represent the same range, for example, "medium" in GB2312 in the corresponding byte is represented as: D6 D0 two bytes, the conversion between characters and bytes, described as encoding and decoding:

L character---byte: encoding, for example: "Medium" UTF-8 encoded as E4 B8 AD

L byte-character: Decode, for example: byte array D0 D6 according to GB2312 Decoded to character "medium"

There is also a class of encodings called URI encoding and URI decoding, but URI encoding and decoding is not a conversion between strings and byte streams, but rather a string representing another string, for example:

L "Medium" UTF-8 URI encoded as%e4%b8%ad

L string%e4%b8%ad URI decoded to character "medium" according to UTF-8

As can be seen, URI encoding is to use a string of%+ corresponding character set encoding organization string to represent, in Java, the string class has two common methods to encode and decode:

L GetBytes: For example "Medium". GetBytes ("character set"), encoded according to the specified character set

L String (bytes[], "character set"): Decodes a byte array based on a custom character set

Why do garbled problems occur

Browser and server connected through the network, browser requests encoded into a stream of bytes on the network, the application server received the browser sent over the byte stream through the corresponding character set and then decoded to a string, if the browser and server side using a different character set or incompatible character set will cause garbled problems, for example, The browser will "medium" according to UTF-8 encoding as bytes E4 B8 AD, transmission on the network, the application server receives the word after the GBK to decode, then the first two bytes are decoded to "Juan", this is a simple process description, the actual process is more complicated than this.

Therefore, it is necessary to understand the coding and decoding of the process and process of the web. The web is the request response mode, the user operates the browser, such as clicking on a button to submit the form, or clicking on a hyperlink, then the browser will send a request to the application server, the servlet container after accepting the request, According to the Web. XML settings call the appropriate application, the application according to send a request for a certain amount of logic processing and return to the browser a piece of HTML code, the browser based on HTML parsing and presentation to the user, this is a request to answer the process, the following sections are described in more detail:

1. The browser sends the request to the application server

The browser sends requests to the application server in three ways: 1. Submit a form, 2. hyperlinks, 3.Ajax;

1) Form submission

Form submission is also divided into post and get two ways,

When the Post method is used, the browser sends the string in the form to the server using the page's character set encoded as a byte stream.

When the Get method is used, the browser first sends the values in the form to the application server after the page's character set enters the line URI and then stitching to the action URL, for example:

<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 "/>

<title>test</title>

<body>

<form action=http://www.google.com>

<input type=text name=test value= "Zhong"/>

<input type=submit/>

</form>

</body>

When you click Submit, the URL is Www.google.com.hk/?test=%E4%B8%AD, you can see that the "medium" word is encoded as%e4%b8%ad, because the character set of the current page is UTF-8, Therefore, when a get form commits, it is URI-encoded according to the UTF-8 character set.

2) Hyperlinks

In the hyperlinks are generally passed parameters, and sometimes pass the Chinese, such as this code: <a href= "http://www.google.com/?test=" >link</a>, the browser eventually sent to the server is the ASC character, The "medium" does not belong to the ASC character set, so it is also URI-encoded, but different browsers use the same character set, such as the above hyperlink fragment, in Windows7, regardless of the page content= "text/html; Charset=utf-8 "or content=" text/html; CHARSET=GBK "IE8 sent is www.google.com.hk/?test=%D6%D0, it can be seen that this is the URI encoding GBK, IE8 the URI encoding in the hyperlink is independent of the encoding of the page, and the default encoding of the system; in WINDOSX In XP, IE8 sends the URI encoding of the character set used by the page encoding, if the page encoding is GBK,IE6 sent for the GBK page encoding, and if the page is encoded as UTF-8, only the first two bytes of the UTF-8 URI encoding are sent; such as Firefox and Chrome, the URI encoding is the encoding of the page.

Operating system

Browser

Page encoding

The request string sent

Description

Windows7

IE8 Chinese

UTF-8

Test=%d6%d0

In Windows7, URI encoding is independent of the encoding of the page using the GBK character set

Windows7

IE8 Chinese

GBK

Test=%d6%d0

Xp

IE8 Chinese

UTF-8

Test=%e4%b8%ad

URI encoding in XP with a character set with page encoding

Xp

IE8 Chinese

GBK

Test=%d6%d0

Windows2003

IE6 Chinese

GBK

Test=%d6%d0

GBK is correct, UTF-8 is not correct

Windows2003

IE6 Chinese

UTF-8

Test=%e4%b8

--

Chrome Chinese, Firefox english

UTF-8

Test=%e4%b8%ad

URI encoding using a page-coded character set

--

Chrome Chinese, Firefox english

GBK

Test=%d6%d0

As can be seen, directly in the URL with the Chinese, the different versions of IE in different operating systems URI encoding may not be the same, Chrome and Firefox use encoding and form get way encoding consistent, therefore, directly in the link to write non-ASC characters is very dangerous, Because the character is encoded in relation to the client's environment. So in order to avoid the browser to do an indeterminate URI encoding, the program needs to be encoded in the Chinese URI in the URL, JavaScript provides the encodeURI () function, which provides the URI encoding of UTF-8, can also be encoded by Java.net.URLEncoder.encode (str, "character set")

3) Ajax

Ajax can specify the get mode or post mode, and the situation is similar to that described above

2. Application Server Get Parameters

In the servlet generally through request.getparameter () to get the parameters sent by the browser, it should be noted that the server servlet at the bottom of the received is the InputStream, that is, the byte stream, Request.getparameter () returns a string, so there is a decoding process inside the GetParameter () method, and the character set used for decoding may differ depending on the application server and operating system. The ServletRequest interface provides a method: Setcharacterencoding () to set the GetParameter decoded character set, which must be called before GetParameter, By looking at the source code of the Tomcate, GetParameter will initialize a map object at the first call, and the map stores the parameter names and parameter values, which are decoded according to the set of character sets, and once the objects have been decoded, The next call is directly from the map to take the value, and do not need to re-decode, so setcharacterencoding must be called before getparameter function, some people say that this method is only valid for the post pass parameter, and the Get method passed the argument is invalid, This is true for TOMCAT5, but it is equally valid for both WebSphere and apsuic,setcharacterencoding for post and get.

Application Server

Default encoding of the system on which the server resides

Page encoding

How to submit

URI encoding

Setcharacterencoding

GetParameter Results

Note

websphere6.1

GBK

UTF-8

POST

-

UTF-8

That's right

Server Default configuration

POST

-

GBK

Error

GET

UTF-8

UTF-8

That's right

Hypertext links

GBK

GBK

That's right

tomcat5.5

GBK

UTF-8

POST

UTF-8

That's right

Uriencoding and Usebodyencodingforuri are not set

POST

GBK

Error

GET

UTF-8

UTF-8

Error

Hypertext links

GBK

GBK

Error

apusic5.1

GBK

UTF-8

POST

UTF-8

That's right

Server Default configuration

POST

GBK

Error

GET

UTF-8

UTF-8

That's right

Hypertext links

GBK

GBK

That's right

As you can see from the table above, websphere6.1,apusic5.1 Application server's Get and post methods the character set used by GetParameter decoding is the character set specified by Setcharacterencoding, and Tomcat5 's post method uses Setcharacterenc Oding, but the Get method is not. Looking back at the process of these experiments, the browser using the Post method will be encoded in the character set of the page into a byte stream sent to the server, the server received a byte stream according to the character set setcharacterencoding set to decode, get the string,

That is, if you use the Post method to submit, as long as the "page encoding character set =setcharacterencoding setting of the CharSet" then GetParameter get the value is correct, get and hyperlink in a similar way,

When a form uses a get submission, it is encodeuri according to the encoding of the page, and the hyperlink can be encoded according to the specified character set.

The common denominator in both ways is that the browser will encode the URI, and in WebSphere, get and hyperlink in a way as long as "URI coded character set =setcharacterencoding character set", then getparameter result is correct, While using get to submit a form with its "URI-encoded Character set = page-encoded character set", the URI-encoded character set of the hyperlink said above, the URI-encoded character set in Chrome and the Firefox browser = page-encoded character set, but IE is not, no regularity.

tomcat5.5 GetParameter Get the Get method or hyperlink passed parameters when the default will be decoded with iso8859-1, for example, the browser sends a UTF-8 encoding request, tomcat5.5 getparameter using iso8859-1 decoding, the result is wrong , if you want to get the correct value, you need to use UTF-8 to decode the tomcat5.5 getparameter, by setting the uriencoding= "UTF-8" or usebodyencodingforuri= "true", will allow Tomcat to use UTF-8 decoding at the time of GetParameter (usebodyencodingforuri= "true" means that the decoded character set is in the same character set as the page encoding), if you do not configure and need to get the correct value, The program is required to transcode, because GetParameter is decoded by Iso8859-1, all first through GetParameter (). GetBytes ("iso8859-1") encoded into the original byte array, and then decoded to a string using the UTF-8 character set : New String (GetParameter (). GetBytes ("Iso8859-1"), "UTF-8")

3. Set the browser's page encoding

The server sent to the browser is encoded into a stream of bytes transmitted over the network, the browser receives a byte stream after the use of the specified character set decoded into a string to show, if the two-link character set inconsistency will also lead to garbled problems,

For example, static HTML files or JSPs are stored in UTF-8, you need to tell the browser to use UTF-8 to decode,

    • If JSP can be through <%@ page contenttype= "text/html; Charset=utf-8 "language=" Java "%> to set up,
    • Static files can be <meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 "/> To set up,
    • If the output is directly in the servlet, it can be response.setcharacterencoding ("UTF-8"), setContentType ("Text/html;charset=utf-8"), SetHeader ("Content-type", "Text/html;charset=utf-8") is set,

These actions are equivalent to adding "content-type:text/html;charset=utf-8" information to the head of response,

The priority of the encoded information in the header is higher than the META tag of the HTML, i.e. if setContentType ("Text/html;charset=utf-8") is set in the Serlvet, the JSP is set to <meta http-equiv= "Content-type" content= "text/html; CHARSET=GBK "/> The browser will be decoded according to the UTF-8 character set,

Top
0
Step

Parsing Web Development Coding issues

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.