Deep analysis of JSP and Servlet's handling of Chinese _jsp programming

Source: Internet
Author: User
Tags ultraedit

There are local languages in every region of the world. The regional difference leads directly to the difference of language environment. In the process of developing an internationalized program, it is important to deal with language problems.

This is a world-wide problem, so Java provides a worldwide solution. The method described in this article is used to deal with Chinese, but, by extension, it is equally applicable for languages that deal with other countries and regions of the world.

The Chinese characters are double-byte. The term "Double byte" refers to the position of a double word to occupy two byte (i.e. 16 bits), which is called High and low. China's encoding for GB2312, which is mandatory, is currently supported by almost all applications that can handle Chinese language GB2312. GB2312 includes one or two-level Chinese characters and 9-zone symbols, high from 0xa1 to 0xFE, low from 0xa1 to 0xFE, where the encoding range of Chinese characters is 0xb0a1 to 0xf7fe.

There is also a code called GBK, but this is a specification, not a mandatory one. GBK provides 20,902 Chinese characters, which are compatible with GB2312, and encode range of 0x8140 to 0xfefe. All characters in the GBK can be mapped to Unicode 2.0.

In the near future, China will enact another standard: gb18030-2000 (GBK2K). It included the Tibetan, Mongolian and other minority fonts, fundamentally solve the problem of lack of character. Note: It is no longer a fixed length. The second byte part is compatible with GBK, and the four-byte portion is an expanded character, glyph. Its first and third bytes are from 0x81 to 0xFE, two bytes and fourth bytes from 0x30 to 0x39.

This article does not intend to introduce Unicode, and interested in browsing "http://www.unicode.org/" to see more information. Unicode has an attribute: it includes all character glyphs in the world. Therefore, the language of each region can establish a mapping relationship with Unicode, and Java is the use of this to achieve the conversion between different languages.

In the JDK, the Chinese-related encodings are:

Table 1 List of Chinese-related encodings in JDK

Encoding Name Description
Ascii 7-bit, same as Ascii7
Iso8859-1 8-bit, with 8859_1,iso-8859-1,iso_8859-1,latin1 ... And so the same
Gb2312-80 16-bit, with gb2312,gb2312-1980,euc_cn,euccn,1381,cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB ... And so the same
GBK Same as MS936, note: case sensitive
UTF8 Same as UTF-8
GB18030 As with cp1392, 1392, the currently supported JDK is very small

In actual programming, more contact is GB2312 (GBK) and iso8859-1.

Why would there be "?" Resolution

As mentioned above, the conversion between different languages is done through Unicode. Suppose there are two different languages A and B, the steps of conversion are: first convert a to Unicode and then convert Unicode to B.

An example is provided. There is a GB2312 in the Chinese character "Li", which is encoded as "c0ee" and wants to be converted into iso8859-1 encoding. The steps are: first the word "Li" into Unicode, get "674E", and then "674E" into iso8859-1 characters. Of course, this mapping will not succeed because there is no character in the iso8859-1 that corresponds to "674E".

When the mapping is unsuccessful, the problem occurs! When converting from a language to Unicode, if there is no such character in a language, the Unicode Code "\UFFFFD" ("\u" representation is Unicode encoding) is obtained. From Unicode to a language, if a language does not have a corresponding character, then the "0x3f" ("?") is obtained. )。 This is the "?" The origin.

For example: the character stream buf = "0x80 0x40 0xb0 0xa1" to the new String (buf, "gb2312") operation, the result is "\ufffd\u554a", and then println out, the result will be "? Ah", because "0x80 0x40 "is the character in the GBK, not in the GB2312.

Again, the string string= "\u00d6\u00ec\u00e9\u0046\u00bb\u00f9" to the new string (Buf.getbytes ("GBK")), and the result is " 3fa8aca8a6463fa8b4 ", of which," \u00d6 "in" GBK "no corresponding characters, get" 3f "," \U00EC "corresponds to" A8ac "," \u00e9 "corresponds to" A8a6 "," 0046 "corresponds to" 46 " (because this is ASCII character), "\U00BB" did not find, get "3f", finally, "\u00f9" corresponds to "a8b4". Println This string, the result is "ìéf?ù". Did you see that? This is not all a question mark, because the GBK and Unicode mappings have characters in addition to Chinese characters, and this example is the best proof.

Therefore, in the Chinese character transcoding, if there is confusion, it is not necessarily all the question mark Oh! However, the wrong is wrong after all, 50 steps and 100 steps and no qualitative differences.

Or ask: What happens if there are in the source character set, but not in Unicode? The answer is not to know. Because I don't have the source character set to do this test on hand. But one thing is for sure, that is, the source character set is not specification. In Java, if this happens, it throws an exception.

What is UTF

UTF is the abbreviation for the Unicode text format, which is formatted as Unicode. For UTF, this is defined as:

(1) If the first 9 digits of Unicode's 16-bit characters are 0, in one byte, the first byte is "0" and the remaining 7 bits are the same as the last 7 digits in the original character, such as "\u0034" (0000 0000 0011 0100), and "34" (0011 0100) ; (same as source Unicode characters);

(2) If the first 5 digits of Unicode's 16-bit characters are 0, in 2 bytes, the first byte begins with "110", followed by 5 digits with the highest 0 digits after the first 5 5 in the source character, and the second byte begins with "10", followed by 6 digits that are the same as the lower 6 digits in the source character. such as "\u025d" (0000 0010 0101 1101), converted to "c99d" (1100 1001 1001 1101);

(3) If the above two rules are not met, it is represented in three bytes. The first byte begins with "1110" and the latter four bits are high four bits of the source character; the second byte begins with "10" and the latter six bits are the middle six digits of the source character, the third byte begins with "10", and the latter six bits are the lower six digits of the source character, such as "\u9da7" (1001 1101 1010 0111), which is converted to " E9b6a7 "(1110 1001 1011 0110 1010 0111);

You can describe the relationship between Unicode and UTF in Java programs, though not absolute: when strings run in memory, they behave as Unicode code, and when they are saved to a file or other media, they are UTF. This conversion process is accomplished by writeUTF and readUTF.

Well, the basics of the discussion is about to come to the bottom of the topic.

Think of this problem first as a black box. Look at the first level of the black box:

Input (Charseta)->process (Unicode)->output (CHARSETB)

Simply, this is an IPO model, that is, input, processing, and output. The same content goes through the transformation of "from Charseta to Unicode to Charsetb".

Look at the second stage of the expression:

SourceFile (Jsp,java)->class->output

In this diagram, you can see that the input is JSP and Java source files, in the process of processing, the class file as a carrier, and then output. Further refinement to level three indicates:

Jsp->temp File->class->browser,os console,db

App,servlet->class->browser,os console,db

The picture is more clear. JSP file into the middle of the Java file, and then generate class. The servlet and ordinary app compile the build class directly. Then, output from class to the browser, console, or database.

JSP: The process from source file to class

The source file for the JSP is a text file that ends with ". JSP". In this section, you will explain the process of interpreting and compiling the JSP file and track the changes in Chinese.

1, the Jsp/servlet engine provides the JSP conversion tool (JSPC) searches the JSP file to use <%@ page contentType = "text/html;" Charset=<jsp-charset> the charset specified in the%>. If <JSP-CHARSET> is not specified in the Jsp file, the default setting in the JVM is file.encoding, which is typically iso8859-1;

2, JSPC with the equivalent of "javac–encoding <Jsp-charset>" command to explain all the characters appearing in the Jsp file, including Chinese characters and ASCII character, and then convert these characters to Unicode characters, and then into the UTF format, Save as a Java file. ASCII characters are converted to Unicode characters simply by adding "00" to the front, such as "A", to "\u0041" (for no reason, Unicode code tables are compiled). Then, after the conversion to UTF, and then back to "41"! This is the reason that you can use the normal text editor to view the Java files generated by the JSP;

3, the engine with the equivalent of "javac–encoding UNICODE" command, the Java file compiled into a class file;

Let's take a look at the conversion of Chinese characters in these procedures. Have the following source code:

<%@ page contenttype= "text/html; charset=gb2312 "%>
<%
String a= "Chinese";
Out.println (a);
%>
</body>

This code is written on UltraEdit for Windows. After saving, the "Chinese" two-word 16 encoding is "D6 D0 CE C4" (GB2312 encoding). After the table, "Chinese" two characters of the Unicode code for "\u4e2d\u6587", with UTF said is "E4 B8 AD E6 96 87". Turn on the engine generated by the JSP file into the Java file, found in the "Chinese" two words are indeed "E4 B8 AD E6 96 87" replaced, and then look at the Java file compiled generated by the class file, found the results of the same as in the Java file.

Then look at the charset specified in the JSP as iso-8859-1.

<%@ page contenttype= "text/html; Charset=iso-8859-1 "%>
<%
String a= "Chinese";
Out.println (a);
%>
</body>

Similarly, the file is written in UltraEdit, "Chinese" is also stored as GB2312 encoding "D6 D0 CE C4". First simulate the process of generating Java files and class files: JSPC uses iso-8859-1 to interpret "Chinese" and maps it to Unicode. Since Iso-8859-1 is a 8-bit, and is a Latin language, its mapping rule is to add "00" in front of each byte, so the mapped Unicode encoding should be "\U00D6\U00D0\U00CE\U00C4", after conversion to UTF should be "C3 C3 8E C3 84 ". OK, open the file to see, in the Java file and class file, "Chinese" is indeed said to be "C3 C3 C3 8E C3 84".

If <JSP-CHARSET> is not specified in the above code, that is, the first line is written as "<%@ page contenttype=" text/html "%>", JSPC uses the File.encoding setting to interpret the Jsp file. On Redhat 6.2, the processing results are identical to those specified as iso-8859-1.

So far, the mapping of Chinese characters from JSP files to class files has been explained. In a word: from "jspcharset to Unicode to UTF". The following table summarizes the process:

Table 2 "Chinese" from JSP to class conversion process

Jsp-charset In the JSP file In the Java file In the class file
GB2312 D6 D0 CE C4 (GB2312) From \u4e2d\u6587 (Unicode) to E4 B8 AD E6 (UTF) E4 B8 AD E6 (UTF)
Iso-8859-1 D6 D0 CE C4
(GB2312)
From \U00D6\U00D0\U00CE\U00C4 (Unicode) to C3 C3 C3 8E C3 (UTF) C3 C3 C3 8E C3 (UTF)
None (default =file.encoding) With Iso-8859-1 With Iso-8859-1 With Iso-8859-1

The following section discusses the process by which the servlet transforms from a Java file to a class file, and then explains how to output from the class file to the client. The reason for this arrangement is that the JSP and the servlet process the same way when they are output.

Servlet: Process from source file to class

The servlet source file is a text file that ends with ". Java". This section discusses the process of compiling the servlet and tracks the Chinese changes in it.

Compile the servlet source file with "Javac". Javac can take the "-encoding <Compile-charset>" parameter, meaning "interpret Serlvet source file with the encoding specified in < Compile-charset >."

At compile time, the source file interprets all characters, including Chinese and ASCII characters, with <Compile-charset>. Then convert the character constants to Unicode characters, and finally, turn Unicode into UTF.

In the servlet, there is also a place to set the charset of the output stream. Typically, before outputting the results, call the HttpServletResponse setContentType method to achieve the same effect as set <Jsp-charset> in the JSP, called <Servlet-charset>.

Note that three variables:<jsp-charset>, <Compile-charset> and <Servlet-charset> are mentioned in the paper. The JSP files are only related to <Jsp-charset>, while <Compile-charset> and <Servlet-charset> are only related to Servlet.

Look at the following example:

Import javax.servlet.*;

Import javax.servlet.http.*;

Class Testservlet extends HttpServlet
{
public void doget (HttpServletRequest req,httpservletresponse resp)
Throws Servletexception,java.io.ioexception
{
Resp.setcontenttype ("text/html; charset=gb2312 ");
Java.io.PrintWriter Out=resp.getwriter ();
Out.println ("Out.println ("#中文 #");
Out.println ("}
}

The file is also written in UltraEdit for Windows, where the "Chinese" two words are saved as "D6 D0 CE C4" (GB2312 encoding).

Start compiling. The following table is the hexadecimal code for the word "Chinese" in the class file when <Compile-charset> is different. ,<servlet-charset> does not work in the compilation process. <Servlet-charset> only has an impact on the output of the class file, actually <Servlet-charset> and <Compile-charset> together, to reach the < in the JSP file Jsp-charset> the same effect because <Jsp-charset> has an impact on the output of both the compilation and class files.

Table 3 "Chinese" from the servlet source file to class transformation process

Compile-charset In the servlet source file In the class file The equivalent Unicode code
GB2312 D6 D0 CE C4
(GB2312)
E4 B8 AD E6 (UTF) \u4e2d\u6587 (in Unicode = "Chinese")
Iso-8859-1 D6 D0 CE C4
(GB2312)
C3 C3 C3 8E C3 (UTF) \u00d6 \u00d0 \u00ce \u00c4 (add a 00 to the front of D6 D0 CE C4)
None (Default) D6 D0 CE C4 (GB2312) With Iso-8859-1 With Iso-8859-1

Normal Java programs are compiled in exactly the same way as the servlet.

Is the Chinese notation in the class file obvious? OK, next to see how class is how to output Chinese?

Class: Output string

As noted above, strings appear in memory as Unicode encodings. As for what this Unicode encoding represents, it depends on which character set it is mapped from, that is to say, to see its ancestors. This is like in the check-in baggage, the appearance of cardboard boxes, what it is to see what the Mail people actually mail something.

Take a look at the example above, if you give a string of Unicode encoding "00d6 00d0 00CE 00c4", if you do not convert it, it is four characters (and special characters) if you compare it with a Unicode code table, and if you map it to "iso8859-1", remove the previous "00" can get "D6 D0 CE C4", this is four characters in the ASCII code table, and if you map it as a GB2312, the result is likely to be a lot of garbled, because there may not be (and may be) characters in the GB2312 that correspond to characters such as 00D6 (if the corresponding , will get 0x3f, that is, the question mark, if the corresponding, because the 00d6 and other words Fu Tai, estimates are also some special symbols, real Chinese characters in the Unicode encoding starting from 4E00.

As you can see, the same Unicode character could be interpreted in a different way. Of course, one of these is the result we expect. The above example, "D6 D0 ce C4" should be what we want, when the "D6 D0 ce C4" output to IE, the "Simplified Chinese" way to view, you can see the clear "Chinese" two words. (Of course, if you have to look at the "Western European character", there is no way, you will not have any time and place of things) why? Because the "00d6 00d0 00CE 00c4" was originally converted from iso8859-1 to the past.
The
The Unicode string is regenerated to a byte stream according to a certain inner code before the class output string, and then the byte stream is entered, which is equivalent to a step "string.getbytes (???)" Operation.??? Represents a character set.

If it is a Servlet, then this is the inner code specified in the Httpservletresponse.setcontenttype () method, which is the defined above.

If it is a Jsp, this is the inner code that is specified in the <%@ page contenttype= "%>", that is, the defined above.

If it is a Java program, this is the inner code specified in file.encoding, which defaults to iso8859-1.

When the output object is a browser,

takes the popular browser ie as an example. IE supports a variety of internal codes. If IE received a word throttle "D6 D0 CE C4", you can try to use a variety of internal code to see. You will find that you can get the correct results with "Simplified Chinese". Because "D6 D0 CE C4" is originally Simplified Chinese "Chinese" two words encoding.

OK, look through it completely.

JSP: The source file is a GB2312-formatted text file, and the JSP source file has the "Chinese" two Chinese characters

If is specified as GB2312, the conversion process is the following table.

Table 4 jsp-charset = GB2312 Change procedure

Serial number Step description Results
1 Write JSP source file, and save as GB2312 format D6 D0 CE C4
(d6d0= in cec4= text)
2 JSPC the JSP source file into a temporary Java file and maps the string to Unicode in GB2312, and writes it to the Java file in UTF format E4 B8 AD E6 96 87
3 To compile a temporary Java file into a class file E4 B8 AD E6 96 87
4 At run time, the string is read out of the class file using readUTF, in memory, Unicode encoding 4E 2D 65 87 (in 4e2d= in unicode 6587 = text)
5 Convert Unicode to byte stream according to jsp-charset=gb2312 D6 D0 CE C4
6 Output the byte stream to ie, and set IE's encoding to GB2312 (the author presses: This information is hidden in the HTTP header) D6 D0 CE C4
7 ie use "Simplified Chinese" to view results "Chinese" (displayed correctly)

If <Jsp-charset> is specified as iso8859-1, the conversion process is the following table.

Table 5 Jsp-charset = iso8859-1 change Process

Serial number Step description Results
1 Write JSP source file, and save as GB2312 format D6 D0 CE C4
(d6d0= in cec4= text)
2 JSPC the JSP source file into a temporary Java file and maps the string to Unicode in iso8859-1, and writes it to the Java file in UTF format C3 C3 C3 8E C3 84
3 To compile a temporary Java file into a class file C3 C3 C3 8E C3 84
4 At run time, the string is read out of the class file using readUTF, in memory, Unicode encoding D6 D0 CE C4
(Nothing is!!!.) )
5 Convert Unicode to byte stream according to Jsp-charset=iso8859-1 D6 D0 CE C4
6 Output the byte stream to ie, and set IE's encoding to iso8859-1 (the author presses: This information is hidden in the HTTP header) D6 D0 CE C4
7 IE view results with "Western European characters" Garbled, is actually four ASCII characters, but due to greater than 128, so show out the absurd
8 Change IE's page code to "Simplified Chinese" "Chinese" (displayed correctly)

That's weird! Why is the <Jsp-charset> set to GB2312 and Iso8859-1 is the same, can be displayed correctly? Because the reciprocal of steps 2nd and 5th in Table 4, table 5, is mutually "offset". However, it is inconvenient to add the 8th step when the designation is iso8859-1.

Then look at the situation when you do not specify <Jsp-charset>.

Table 6 The change process when Jsp-charset is not specified

Serial number Step description Results
1 Write JSP source file, and save as GB2312 format D6 D0 CE C4
(d6d0= in cec4= text)
2 JSPC the JSP source file into a temporary Java file and maps the string to Unicode in iso8859-1, and writes it to the Java file in UTF format C3 C3 C3 8E C3 84
3 To compile a temporary Java file into a class file C3 C3 C3 8E C3 84
4 At run time, the string is read out of the class file using readUTF, in memory, Unicode encoding D6 D0 CE C4
5 Convert Unicode to byte stream according to Jsp-charset=iso8859-1 D6 D0 CE C4
6 Output the stream of bytes to IE D6 D0 CE C4
7 IE view the results with the encoding of the page when the request is made Depending on the situation. If it is simplified Chinese, it will be displayed correctly, otherwise, the 8th step in table 5 should be performed

Servlet: Source file is Java file, format is GB2312, source file contains "Chinese" these two characters

If <Compile-charset> =gb2312, <Servlet-charset> =gb2312

Table 7 The change process when compile-charset=servlet-charset=gb2312

Serial number Step description Results
1 Write the servlet source file and save it as GB2312 format D6 D0 CE C4
(d6d0= in cec4= text)
2 Compiling Java source files into class files with javac–encoding GB2312 E4 B8 AD E6 (UTF)
3 At run time, the string is read out of the class file using readUTF, in memory, Unicode encoding 4E 2D (Unicode)
4 Convert Unicode to byte stream according to servlet-charset=gb2312 D6 D0 CE C4 (GB2312)
5 Output the byte stream to IE and set IE's encoding attribute to servlet-charset=gb2312 D6 D0 CE C4 (GB2312)
6 ie use "Simplified Chinese" to view results "Chinese" (displayed correctly)

If <Compile-charset> =iso8859-1, <Servlet-charset> =iso8859-1

Table 8 The change process when compile-charset=servlet-charset=iso8859-1

Serial number Step description Results
1 Write the servlet source file and save it as GB2312 format D6 D0 CE C4
(d6d0= in cec4= text)
2 Compiling Java source files into class files with javac–encoding iso8859-1 C3 C3 C3 8E C3 (UTF)
3 At run time, the string is read out of the class file using readUTF, in memory, Unicode encoding D6 D0 CE C4
4 Convert Unicode to byte stream according to Servlet-charset=iso8859-1 D6 D0 CE C4
5 Output the byte stream to IE and set IE's encoding attribute to servlet-charset=iso8859-1 D6 D0 CE C4 (GB2312)
6 IE view results with "Western European characters" Garbled (same reason table 5)
7 Change IE's page code to "Simplified Chinese" "Chinese" (displayed correctly)

If you do not specify Compile-charset or Servlet-charset, the default value is Iso8859-1.

When the Compile-charset=servlet-charset, the 2nd and 4th steps can be reversed, "offset", the display results can be correct. It is certainly incorrect for the reader to try to write a compile-charset <> servlet-charset.

When the output object is a database

Output to the database, the principle and output to the browser is the same. This section is just a servlet example, and the JSP is for the reader to deduce it.

Suppose there is a servlet that can receive a Chinese character string from the client (IE, Simplified Chinese) and then write it to the ISO8859-1 database and then remove the string from the database and display it to the client.

Table 9 The process of changing the output object as a database (1)

Serial number Step description Results Domain
1 Enter "Chinese" in IE D6 D0 CE C4 Ie
2 IE converts the string into UTF and feeds into the transport stream E4 B8 AD E6 96 87
3 servlet receives input stream, reads with READUTF 4E 2D (Unicode) Servlet
4 In the servlet, the programmer must revert the string to a byte stream based on GB2312 D6 D0 CE C4
5 The programmer generates a new string based on the code iso8859-1 the database D6 D0 CE C4
6 Submit the newly generated string to JDBC D6 D0 CE C4
7 JDBC detects database code as Iso8859-1 D6 D0 CE C4 Jdbc
8 JDBC generates a byte stream for the received string according to Iso8859-1 D6 D0 CE C4
9 JDBC writes a stream of bytes to the database D6 D0 CE C4
10 Complete data storage work D6 D0 CE C4 Database
The following is the process of fetching numbers from a database
11 JDBC takes a byte stream out of the database D6 D0 CE C4 Jdbc
12 JDBC generates strings according to the character set iso8859-1 of the database and submits them to the servlet D6 D0 CE C4 (Unicode)
13 Servlet gets string D6 D0 CE C4 (Unicode) Servlet
14 The programmer must revert to the original byte stream according to the iso8859-1 of the database D6 D0 CE C4
15 The programmer must generate a new string based on the client character set GB2312 4E 2D 65 87
(Unicode)
The servlet prepares to output the string to the client
16 Servlet generates a byte stream based on the D6d0 CE C4 Servlet
17 The Servlet outputs the stream of bytes to IE, and if is specified, also sets IE's encoding to D6 D0 CE C4
18 IE view results based on the specified encoding or default encoding "Chinese" (displayed correctly) Ie

Explain that the 4th, 5th, and 15th steps of the table are marked in red, indicating that the translator is to be converted by the coder. The 4th and 52 steps are actually one sentence: "New String (Source.getbytes (" GB2312 ")," iso8859-1 "). The 15th and 162 steps are also one sentence: "New String (Source.getbytes (" iso8859-1 ")," GB2312 "). Dear readers, are you aware of each of these details when writing code like this?

As for the client internal code and the database code for other values of the process, and output object is the system console process, please readers themselves. Understand the principles of the above process, I believe you can easily write out.

The wording is now over. The end point is back to the starting point, and for programmers, almost nothing is affected.

Because we were already told to do so.

Here's a conclusion, as an end.

1, in the JSP file, to specify ContentType, where the value of the charset to the same character set used by the client browser; For string constants, no internal code conversion is required; for string variables, Requires that the character set specified in ContentType can be restored to a byte stream that is recognized by the client, simply that "the string variable is based on the <Jsp-charset> character set";

2. In the servlet, the charset must be set with Httpservletresponse.setcontenttype () and set to conform to the client-side code, and for the string constants, the encoding should be specified at Javac compile time. This encoding must be the same as the character set of the platform on which the source file is written, generally GB2312 or GBK; for string variables, as with JSPs, it must be "based on the <Servlet-charset> character set."

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.