Similarly, this file is written using ultraedit, and the word "Chinese" is also saved as gb2312 encoded "D6 D0 ce C4 ". First simulate the generated Java
Files and Classes
Text
The process of the component: jspc uses a ISO-8859-1 to explain "Chinese" and maps it to Unicode. Since the ISO-8859-1 is 8-bit and Latin, its ing rules are
"00" is added before each byte. Therefore, the ing unicode encoding should be "/u00d6/u00d0/u00ce/u00c4". After conversion to UTF, it should be "C3
96 C3 90 C3 8e C3 84 ". Okay. Open the file and check that Java
Files and Classes
In the file, "Chinese" is actually expressed as "C3 96 C3 90 C3 8e C3 84 ".
If <JSP-charset> is not specified in the above Code, the first line is written as "<% @ page contenttype =" text/html"
%> ", Jspc will use the file. Encoding settings to interpret the JSP file. In RedHat
6.2, the processing result is exactly the same as that specified for the ISO-8859-1.
So far, I have explained the process from JSP files to class
The ing process of Chinese Characters During file transformation. One sentence: From "jspcharset to Unicode to UTF ". The following table summarizes the process:
Table 2 "Chinese" from JSP to class
Conversion process
Java in JSP-charset JSP file
Class in the file
File
Gb2312 D6 D0 ce C4 (gb2312) from/u4e2d/u6587 (UNICODE) to E4 B8 ad E6 96 87 (UTF) E4 B8 ad E6 96 87 (UTF)
ISO-8859-1 D6 D0 ce C4
(Gb2312) from/u00d6/u00d0/u00ce/u00c4 (UNICODE) to C3 96 C3 90 C3 8e C3 84 (UTF) C3 96 C3 90 C3 8e C3 84 (UTF)
None (default = file. Encoding) Same ISO-8859-1 with ISO-8859-1 same ISO-8859-1
In the next section, we will first discuss servlet from Java
File to class
File conversion process, and then explain from the class
File output to the client. The reason for this arrangement is that JSP and Servlet have the same processing method in output.
Servlet: from source file to class
Process
The Servlet Source file is ". Java"
. This section describes the servlet compilation process and tracks Chinese changes.
Use "javac" to compile the servlet source file. Javac can contain the "-encoding <compile-charset>" parameter, which means "interpreting the Serlvet source file with the encoding specified in <compile-charset> ".
When the source file is compiled, <compile-charset> is used to interpret all characters, including Chinese and ASCII characters. Then convert the character constant to the Unicode character, and finally convert the Unicode to UTF.
In servlet, you can also set the charset of the output stream. Before the output result
The setcontenttype method achieves the same effect as setting <JSP-charset> in JSP, which is called <servlet-charset>.
Note: three variables are mentioned in this article: <JSP-charset>, <compile-charset>, and <servlet-charset>. Its
.
Off.
See the following example:
Import javax. servlet .*;
Import javax. servlet. http .*;
Class
Testservlet extends httpservlet
{
Public void doget (httpservletrequest req, httpservletresponse resp)
Throws servletexception, Java
. Io. ioexception
{
Resp. setcontenttype ("text/html; charset = gb2312 ");
Java
. Io. printwriter out = resp. getwriter ();
Out. println ("<HTML> ");
Out. println ("# Chinese #");
Out. println ("
}
}
This file is also written in ultraedit for Windows, where the word "Chinese" is saved as "D6 D0 ce C4" (gb2312 encoding ).
Start compilation. The following table shows the differences between <compile-charset> and class
The hexadecimal code of the Chinese character in the file. <Servlet-charset> does not play any role during compilation. <Servlet-charset> only for class
The output of the file is actually <servlet-charset> and <compile-charset> together to achieve the same effect as <JSP-charset> In the JSP file, because <JSP-charset>
File output will be affected.
Table 3 "Chinese" from Servlet Source file to class
Transformation Process
Compile-charset Servlet Source File class
Equivalent Unicode code in the file
Gb2312 D6 D0 ce C4
(Gb2312) E4 B8 ad E6 96 87 (UTF)/u4e2d/u6587 (= "Chinese" in UNICODE ")
ISO-8859-1 D6 D0 ce C4
(Gb2312) C3 96 C3 90 C3 8e C3 84 (UTF)/u00d6/u00d0/u00ce/u00c4 (one 00 each added before D6 D0 ce C4)
None (default) D6 D0 ce C4 (gb2312) Same ISO-8859-1 with ISO-8859-1
Common Java
The compilation process of the program is exactly the same as that of the servlet.
Class
Is the Chinese Representation in the file explicit? OK. Let's take a look at the class.
How does one output Chinese characters?
Class
: Output string
As mentioned above, strings are encoded in Unicode in memory. As for what unicode encoding represents, it depends on the character set from which it maps, that is, its ancestor. This is like when I checked my luggage, it looked like a paper box. What is contained in it depends on what is actually mailed by the mail recipient.
Take a look at the example above. If you encode a string of Unicode codes "00d6 00d0 00ce
00c4 ", if not converted, directly compare it with the Unicode code table, is four characters (and special characters); if it is mapped to the" ISO8859-1 ", directly remove
The previous "00" will get "D6 D0 CE
C4 ", which is four characters in the ASCII code table. If we map it as gb2312, the result may be a lot of garbled characters, because it may not exist in gb2312 (or
It may be) the character corresponds to characters such as 00d6 (if it does not match, it will get 0x3f, that is, the question mark. If it is matched, because the characters such as 00d6 are too high, it is estimated that they are also some special characters, real
The encoding of Chinese Characters in Unicode starts from 4e00 ).
As you can see, the same UNICODE character can be interpreted as different. Of course, one of these is the expected result. In the above example, "D6 D0 CE
C4 should be what we want.
When C4 is output to IE, you can view the word "Chinese" in simplified Chinese. (Of course, if you must use the "Western European character", you will not be able to get any
Why? Because "00d6 00d0 00ce 00c4" was originally converted from the ISO8859-1.
The following conclusions are given:
In the class
Before outputting a string, the Unicode string is re-generated to the byte stream according to a certain internal code, and then the byte stream is input, which is equivalent to a step of "string. getbytes (???)" Operation .??? Represents a character set.
For servlet, this internal code is the internal code specified in the httpservletresponse. setcontenttype () method, that is, the <servlet-charset> defined above.
For JSP, this internal code is the internal code specified in <% @ page contenttype = "" %>, that is, the <JSP-charset> defined above.
For Java
Program, then, this internal code is the internal code specified in file. encoding, the default is the ISO8859-1.
When the output object is a browser
Take the popular Browser IE as an example. Internet Explorer supports multiple internal codes. If IE receives a byte stream "D6 D0 ce C4", you can try to view it with various internal codes. You will find that you can get the correct results when using "simplified Chinese. Because "D6 D0 ce C4" is originally the encoding of "Chinese" in simplified Chinese.
OK, read it completely.
JSP: the source file is a text file in gb2312 format, and the JSP source file contains the Chinese characters "Chinese ".
If <JSP-charset> is set to gb2312, the conversion process is as follows.
Table 4 Changes in JSP-charset = gb2312
Sequence Number step description result
1. Compile the JSP source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2. jspc converts JSP source files to temporary Java
File, map the string to Unicode according to gb2312, and write the string to Java in UTF format
E4 B8 ad E6 96 87 in the file
3. Set the temporary Java
Compile the file into a class
File E4 B8 ad E6 96 87
4.
Read the string with readutf in the file, and the Unicode code 44e 2D 65 87 in the memory (4e2d = medium 6587 = text in UNICODE)
5 convert Unicode to byte stream D6 D0 ce C4 Based on JSP-charset = gb2312
6. output the byte stream to IE and set the IE encoding to gb2312 (the author presses: This information is hidden in the HTTP header) D6 D0 ce C4
7. Use "simplified Chinese" in IE to view the result "Chinese" (displayed correctly)
If you specify <JSP-charset> as the ISO8859-1, the conversion process is as follows.
Table 5 process of change when JSP-charset = ISO8859-1
Sequence Number step description result
1. Compile the JSP source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2. jspc converts JSP source files to temporary Java
File, map the string to Unicode according to the ISO8859-1, and write the string to Java in UTF format
File C3 96 C3 90 C3 8e C3 84
3. Set the temporary Java
Compile the file into a class
File C3 96 C3 90 C3 8e C3 84
4.
Read the string with readutf in the file, and the Unicode code 00 D6 00 D0 00 ce 00 C4 in the memory
(Nothing !!!)
5 convert Unicode to byte stream D6 D0 ce C4 according to JSP-charset = ISO8859-1
6 output the byte stream to IE and set ie encoding to ISO8859-1 (by: This information is hidden in the HTTP header) D6 D0 ce C4
7 ie uses the "Western European character" to check the garbled result. It is actually four ASCII characters, but it is a strange display because it is greater than 128.
8. Change the Page code of IE to "simplified Chinese" and "Chinese" (displayed correctly)
Strange! Why the <JSP-charset> set to gb2312 and ISO8859-1 is a sample, can be correctly displayed? Because steps 2nd and 5th in table 4 and table 5 are mutually "offset. It is inconvenient to add Step 1 when you specify it as a ISO8859-1.
Check whether <JSP-charset> is specified.
Table 6 changes when JSP-charset is not specified
Sequence Number step description result
1. Compile the JSP source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2. jspc converts JSP source files to temporary Java
File, map the string to Unicode according to the ISO8859-1, and write the string to Java in UTF format
File C3 96 C3 90 C3 8e C3 84
3. Set the temporary Java
Compile the file into a class
File C3 96 C3 90 C3 8e C3 84
4.
Read the string with readutf in the file, and the Unicode code 00 D6 00 D0 00 ce 00 C4 in the memory
5 convert Unicode to byte stream D6 D0 ce C4 according to JSP-charset = ISO8859-1
6. output the byte stream to IE, D6 D0 ce C4
7. ie uses the page encoding when sending the request to view the results based on the situation. If it is in simplified Chinese, it will be correctly displayed. Otherwise, you need to execute Step 5 in table 5.
Servlet: the source file is Java
File in the format of gb2312. The source file contains the Chinese characters "Chinese ".
If <compile-charset> = gb2312, <servlet-charset> = gb2312
Table 7 Changes in compile-charset = servlet-charset = gb2312
Sequence Number step description result
1. Write the Servlet Source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2 Use javac-encoding gb2312
Compile the source file into a class
File E4 B8 ad E6 96 87 (UTF)
3.
Read the string with readutf in the file, and the Unicode code 44e 2D 65 87 (UNICODE) in the memory)
4 convert Unicode to byte stream D6 D0 ce C4 (gb2312) based on servlet-charset = gb2312)
5. output the byte stream to IE and set the IE encoding attribute to servlet-charset = gb2312 D6 D0 ce C4 (gb2312)
6. Use "simplified Chinese" in IE to view the result "Chinese" (displayed correctly)
If <compile-charset> = ISO8859-1, <servlet-charset> = ISO8859-1
Table 8 changes in compile-charset = servlet-charset = ISO8859-1
Sequence Number step description result
1. Write the Servlet Source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2 Use javac-encoding ISO8859-1
Compile the source file into a class
File C3 96 C3 90 C3 8e C3 84 (UTF)
3.
Read the string with readutf in the file, and the Unicode code 00 D6 00 D0 00 ce 00 C4 in the memory
4 convert Unicode to byte stream D6 D0 ce C4 according to servlet-charset = ISO8859-1
5 output the byte stream to IE and set the IE encoding attribute to servlet-charset = ISO8859-1 D6 D0 ce C4 (gb2312)
6. Use the "Western European character" in IE to view the garbled characters (the reason is the same as table 5)
7. Change the Page code of IE to "simplified Chinese" and "Chinese" (displayed correctly)
If you do not specify compile-charset or servlet-charset, the default value is ISO8859-1.
When compile-charset = servlet-charset, steps 1 and 2 can be reversed and offset, and the results are correct. You can try to write compile-charset <> servlet-charset, which is definitely incorrect.
When the output object is a database
When outputting data to a database, the principle is the same as when outputting data to a browser. This section uses Servlet as an example. You can deduce the JSP information by yourself.
Suppose there is a servlet that can receive a Chinese character string from the client (ie, simplified Chinese), and then write it into the database where the internal code is ISO8859-1, then retrieve the string from the database and display it to the client.
Table 9 the output object is the change process in the database (1)
Sequence Number step description result Field
1. Enter "Chinese" in IE D6 D0 ce C4 IE
2 ie converts the string into UTF and sends it to the transmission stream E4 B8 ad E6 96 87
3 servlet receives the input stream and reads 4E 2D 65 87 (UNICODE) servlet with readutf
4 In servlet, the programmer must restore the string to the byte stream D6 D0 ce C4 Based on gb2312.
5 programmers generate new string 00 D6 00 D0 00 ce 00 C4 based on the database internal code ISO8859-1
6. submit the new string to JDBC 00 D6 00 D0 00 ce 00 C4.
7 JDBC detected that the database code is ISO8859-1 00 D6 00 D0 00 ce 00 C4 JDBC
8 JDBC generates byte streams D6 D0 ce C4 Based on the received string according to the ISO8859-1
9 JDBC writes the byte stream to the database D6 D0 ce C4
10 complete data storage D6 D0 ce C4 Database
The following process is used to retrieve data from the database:
11 JDBC extracts byte streams from the database D6 D0 ce C4 JDBC
12 JDBC generates a string according to the character set ISO8859-1 of the database and submits it to servlet 00 D6 00 D0 00 ce 00 C4 (UNICODE)
13 servlet get string 00 D6 00 D0 00 ce 00 C4 (UNICODE) Servlet
14 programmers must follow the database's internal code ISO8859-1 to restore the original byte stream D6 D0 ce C4
15 programmers must generate a new 4E 2D 65 87 String Based on the client Character Set gb2312
(UNICODE)
Servlet is going to output the string to the client
16 servlet: Generate byte stream d6d0 ce C4 servlet Based on <servlet-charset>
17 servlet outputs the byte stream to IE. If <servlet-charset> is specified, it also sets the IE encoding to <servlet-charset> D6 D0 ce C4
18 ie: view the result "Chinese" based on the specified encoding or default encoding (displayed correctly) IE
To explain, steps 4th and 5th in the table are marked in red, indicating that the conversion is performed by the coders. The two steps 4th and 5 are actually one sentence: "New
String (source. getbytes ("gb2312"), "ISO8859-1 ")". The two steps 15th and 16 are also a sentence: "New
String (source. getbytes ("ISO8859-1 "),
"Gb2312 ")". Dear reader, do you realize every detail in this code?
When the client internal code and database internal code are other values, and the output object is the process in the system console, please think for yourself. I understand the principles of the above process and believe you can easily write it out.
So far, it is time to come to an end. The end point is back to the start point, which has almost no effect on programmers.
Because we have long been accused of doing this.
The following is a conclusion.
1,
In the JSP file, you must specify the contenttype. The charset value must be the same as the character set used by the client browser. For the character string constants, no internal code is required to be transferred.
For string variables, you must restore the data to a byte stream that the client can recognize Based on the character set specified in contenttype. Simply put, the string variable is based on <JSP-
Charset> Character Set ";
2,
In servlet, you must use httpservletresponse. setcontenttype () to set charset and set it to be consistent with the client internal code;
For the string constants, you must specify encoding during javac compilation. This encoding must be the same as the character set of the platform for compiling source files. Generally
Gb2312 or GBK; for string variables, like JSP, must be "based on the <servlet-charset> character set ".
From: http://lei-1021.javaeye.com/blog/218600