Question about submitting the value of Chinese characters on the web

Last Update:2014-01-02 Source: Internet

Author: User

Tags form post

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There are two types of Chinese characters in a URL. One is that the Chinese character appears in the URL path, and the other is that the Chinese character appears in the URL parameter.
The first case depends on whether the WEB server and the operating system support it. However, this approach should be avoided when developing WEB applications.
In the second case, the encoded parameter must be passed, and the decoded parameter value can be obtained when the request is accepted.
Parameters can be submitted through the form, input in the address bar of the browser, and click the URL link.
The following tests are carried out in tomcat.
First, let's look at the form post submission method.

The html file is as follows:

<! DOCTYPE html PUBLIC "-// W3C // dtd html 4.01 Transitional // EN" "http://www.w3.org/TR/html4/loose.dtd"> 
Set meta to <meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"/>

Submitted content captured by Fiddler, for example, param = % D6 % D0 % CE % C4,

System. out. println (java.net. URLEncoder. encode ("Chinese", "gb2312"); % D6 % D0 % CE % C4

It can be determined that the browser encodes the parameter into GB2312

Use the following code in the back-end code to obtain the correct data.

Request. setCharacterEncoding ("GB2312 ");

String param = request. getParameter ("param ");

Note that setCharacterEncoding must be executed before obtaining data from the request.

If encoding is not specified by setCharacterEncoding, the pipeline stream is encoded by ISO-8859-1 by default and needs to be converted as follows.

String gbparam = new String (request. getParameter ("param"). getBytes ("ISO-8859-1"), "GB2312 ");


Set meta to <meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>
The submitted content captured by Fiddler is as follows: param = % E4 % B8 % AD % E6 % 96% 87,
System. out. println (java.net. URLEncoder. encode ("Chinese", "UTF-8"); the result is % E4 % B8 % AD % E6 % 96% 87.
It can be determined that the browser encodes the parameter into UTF-8
Use the following code in the back-end code to obtain the correct data.
Request. setCharacterEncoding ("UTF-8 ");
String param = request. getParameter ("param ");

In the address bar, enter http: // localhost: 8080/prjWebSec/encode/enctest1.jsp? Param = Chinese
The submitted content captured by Fiddler is as follows. Both firefox and chrome use UTF-8 encoding. [This may be related to OS and server settings, through System. out. println (System. getProperty ("file. encoding ") utf8]
Http: // localhost: 8080/prjWebSec/encode/enctest1.jsp? Param = % E4 % B8 % AD % E6 % 96% 87

In my test, if the jsp file is submitted after post, no matter whether the meta content is set to gb2312 or UTF-8,
All submitted parameters are UTF-8 encoded. [This may be related to OS and server settings. utf8 is obtained through System. out. println (System. getProperty ("file. encoding")]
If you encode the parameters in the link as follows:
<A href = "encode/enctest2.jsp? Param = <% = URLEncoder. encode ("Chinese", "UTF-8") %> "> Chinese encoded </a> <br>
The source code in the browser is as follows: Chinese characters are UTF-8 encoded.
<A href = "encode/enctest2.jsp? Param = % E4 % B8 % AD % E6 % 96% 87 "> Chinese encoded </a> <br>
After clicking the link, the address bar of the browser becomes as follows. The browser automatically decodes the URL in the address bar. Http: // localhost: 8080/prjWebSec/encode/enctest2.jsp? Param = Chinese,
% E4 % B8 % AD % E6 % 96% 87 is submitted and transmitted to the server.
After the backend obtains the parameter, it is consistent with the post processing principle.

If you encode the parameters twice in the link
<A href = "encode/enctest2.jsp? Param = <% = URLEncoder. encode
(URLEncoder. encode ("Chinese", "UTF-8"), "UTF-8") %> ">
The source code in the browser is as follows: the first time Chinese characters are encoded as % E4 % B8 % AD % E6 % 96% 87, and the second time % is encoded as % 25. The final result is as follows:
<A href = "encode/enctest2.jsp? Param = % 25E4% 25B8% 25AD % 25E6% 2596% 2587 "> Chinese double encoded </a>
After clicking the link, the browser's address bar is in the same format as that in the source code. the browser will only automatically decode % xx and will not automatically decode % xxxx.
Http: // localhost: 8080/prjWebSec/encode/enctest2.jsp? Param = % 25E4% 25B8% 25AD % 25E6% 2596% 2587
The backend needs to perform a UTF-8 decode after obtaining the parameters. The request Encoding does not matter. The request. getParameter ("param ")
The obtained value is always % E4 % B8 % AD % E6 % 96% 87. The following code can get the value "Chinese.
String decodedparam = URLDecoder. decode (request. getParameter ("param"), "UTF-8 ");
For encoding, if multi-byte characters are processed, it is generally better to set both the submission and the receiving end to UTF-8.
The acceptor encoding must be set before obtaining data from the request. Generally, there is no need to encoding the parameter twice in the link.

The parameter submission and reading process should be performed by the browser to encode the parameter,
The encoded parameter is sent to the web server. After arriving at the server, the server will encode the content.
Take tomcat as an example, this occurs (UTF-8 --> ISO-8859-1), this time the parameter value is represented by the ISO-8859-1.
In request. setCharacterEncoding ("UTF-8 ");
Then request. getParameter ("param") is converted back to UTF-8.

If the byte stream is obtained directly through stream, the content is as follows:
112 97 114 97 109 61 37 69 52 37 66 56 37 65 68 37 69 54 37 57 54 37 56 55
It corresponds to param = % E4 % B8 % AD % E6 % 96% 87, so the decoding is performed in request. getParameter.
If encoding is set for setCharacterEncoding before this, the encoding is used for decoding. Otherwise, use the servlet server's
The default encoding is used for decoding. The code for retrieving byte streams is as follows:

ServletInputStream  input = request.getInputStream();System.out.println(input);byte[] readBytes = new byte[10];while (true) {if (input.available() >= 10) {input.read(readBytes);for (byte b : readBytes) {System.out.print(b + " ");}} else {int readByte = 0; while((readByte = input.read()) !=-1){System.out.print(readByte + " "); }break;}}Generally, you can configure a fittler and encode the request in filtter as follows:


<filter><filter-name>encodingFilter</filter-name><filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class><init-param><param-name>encoding</param-name><param-value>UTF8</param-value></init-param><init-param><param-name>forceEncoding</param-name><param-value>true</param-value></init-param></filter><filter-mapping><filter-name>encodingFilter</filter-name><url-pattern>*.html</url-pattern></filter-mapping><filter-mapping><filter-name>encodingFilter</filter-name><url-pattern>*.jsp</url-pattern></filter-mapping> Supplement to ISO-88591

ISO-8859-1 is a single-byte encoding, it is just to read the byte stream a byte, and then converted into ISO-8859-1 encoded characters,

The encoding range is 0-255 (x00-xFF), the same as the encoding range of a single byte of all other encodings,

Because it is a single-byte operation, the content of the byte stream will not be changed.

If you read a multi-byte encoded file in a ISO-8859-1 and convert it into a string of the ISO-8859-1,

And then output to another file in ISO-8859-1 encoding, the contents of the saved file (Binary) does not change.

ASCII is also a single-byte encoding, but its encoding range is 0-127 and cannot contain other single-byte encoding, more than 127

Is the byte converted to ascii? (X3F) character, the byte content becomes x3F.

If an error occurs between multiple bytes, an error occurs. For example, each Chinese Character of GB2312 is represented in two bytes, and UTF-8 is represented in three bytes.

If you use GB2312 to read a UTF-8 encoded file, take UTF-8 of "Chinese" as an example (% E4 % B8 % AD % E6 % 96% 87 ),

GB2312 uses % E4 % B8 as a character, which may be another character in gb231 encoding. At this time, the byte content is not changed and % AD % E6 is used as a character, the corresponding character may not be found. If not, it becomes? Character, the byte content also changed, % 96% 87 is the same.

% E4 % B8 % AD % E6 % 96% 87 corresponds to gb2312 as "Juan ?? ". If it is saved, it becomes % E4 % B8 % 3F % 3F.

The following code does not "Destroy" the content of the file.

public static void ioiso88591file(String fileName,String outfileName) throws Exception {File file = new File(fileName);FileInputStream fis = new FileInputStream(file);InputStreamReader fr = new InputStreamReader(fis,"iso-8859-1");BufferedReader br = new BufferedReader(fr);File outfile = new File(outfileName);FileOutputStream fos = new FileOutputStream(outfile);OutputStreamWriter ofr = new OutputStreamWriter(fos,"iso-8859-1");BufferedWriter bwr = new BufferedWriter(ofr);String line;while ((line=br.readLine()) != null){System.out.println(line);bwr.write(line);}br.close();fr.close();fis.close();bwr.close();ofr.close();fos.close();}If you use the following method to call the example, the binary content of utf8.txtand utf8iso88591.txt is the same.

The binary content of gb2312.txtand gb2312iso88591.txt is the same.

Ioiso88591file ("C:/D/charset/utf8.txt", "C:/D/charset/utf8iso88591.txt ");

Ioiso88591file ("C:/D/charset/gb2312.txt", "C:/D/charset/gb2312iso88591.txt ");

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More