The Chinese character is dubyte. The so-called double byte means that a double character occupies two bytes (that is, 16 bytes), which are called high and low. Chinese character encoding is gb2312, which is mandatory. Currently, almost all Chinese characters can be processed.ProgramBoth support gb2312. Gb2312 includes level 1 and level 2 Chinese characters and Zone 9 characters. The height ranges from 0xa1 to 0xfe and from 0xa1 to 0xfe. The encoding range of Chinese characters is 0xb0a1 to 0xf7fe.
There is another encoding called GBK, but this is a specification, not mandatory. GBK provides 20902 Chinese characters. It is compatible with gb2312 and the encoding range is 0x8140 to 0 xfefe. All characters in GBK can be mapped to Unicode 2.0 one by one.
In the near future, China will adopt another standard: GB18030-2000 (gbk2k ). It contains fonts of ethnic minorities such as Tibet and Mongolia, and fundamentally solves the problem of insufficient characters. Note: It is no longer a fixed length. The second part is compatible with GBK, and the four parts are expanded characters and fonts. Its first and third bytes are from 0x81 to 0xfe, and the second and fourth bytes are from 0x30 to 0x39.
This article does not describe Unicode. If you are interested, you can view "http://www.unicode.org/##more information. Unicode has a feature: it includes all the character fonts in the world. Therefore, the language in each region can establish a unicode ing relationship with Unicode, and Java uses this to achieve conversion between different languages.
In JDK, the Chinese encoding is as follows:
Table 1 List of Chinese-related codes in JDK
Encoding Name Description
ASCII 7-bit, same as ascii7
ISO8859-1 8-bit, same as 8859_1, ISO-8859-1, ISO_8859-1, latin1. ..
GB2312-80 16 bits, same as gb2312, gb2312-1980, euc_cn, euccn, 1381, cp1381, 1383, cp1383, iso2022cn, iso2022cn_gb...
GBK is the same as ms936. Note: Case Sensitive
Utf8 is the same as UTF-8
Gb18030 is the same as cp1392 and 1392. Currently, few JDK instances are supported.
In actual programming, more exposure is gb2312 (GBK) and ISO8859-1.
Why is there "?"? No.
As mentioned above, conversion between different languages is done through Unicode. Assume that there are two different languages A and B. The conversion procedure is: convert a to Unicode first, and then convert Unicode to B.
Examples. There is a Chinese character "Li" in gb2312, its encoding is "c0ee", to convert to ISO8859-1 encoding. Step: first convert the word "Li" into Unicode, get "674e", then "674e" into ISO8859-1 characters. Of course, this ing won't succeed, because the root in the ISO8859-1 has no character corresponding to "674e.
When the ing fails, the problem occurs! When converting from a language to Unicode, if this character is not found in a language, the resulting UnicodeCode"\ Uffffd" ("\ U" indicates unicode encoding ,). When converting from Unicode to a language, if a language does not have a corresponding character, it will get "0x3f" ("?"). This is "?" .
For example, you can perform the new string (BUF, "gb2312") operation on the primary stream Buf = "0x80 0x40 0xb0 0xa1". The result is \ ufffd \ u554a ", after println is created, the result is "? Ah ", because" 0x80 0x40 "is a character in GBK, it is not in gb2312.
For another example, set string = "\ u00d6 \ u00ec \ u00e9 \ u0046 \ u00bb \ u00f9" to new string (BUF. the getbytes ("GBK") operation returns "3fa8aca8a6463fa8b4", where "\ u00d6" does not contain the corresponding characters in "GBK", and "3f" is obtained ", "\ u00ec" corresponds to "a8ac", "\ u00e9" corresponds to "a8a6", and "0046" corresponds to "46" (because this is an ASCII character ), "\ u00bb" is not found and "3f" is obtained. Finally, "\ u00f9" corresponds to "a8b4 ". Run println on this string and the result is "? Ì é f? ". See? It is not all question marks, because there are characters in the content mapped by GBK and Unicode in addition to Chinese characters. This example is the best example.
Therefore, if a problem occurs during Chinese Character transcoding, it may not always be a question mark! However, after all, there is no quality difference between step 50 and step 100.
You may also ask: What will happen if the source character set does not exist in Unicode? The answer is unknown. Because I have no source character set for this test. However, the source character set is not standardized. In Java, if this happens, an exception is thrown.
What is UTF?
UTF, short for Unicode text format, stands for the Unicode text format. UTF is defined as follows:
(1) If the first nine digits of the Unicode 16-bit character are 0, a byte is used to indicate that the first byte is "0 ", the remaining 7 digits are the same as the last 7 digits of the original character, for example, "\ u0034" (0000 0000 0011 0100), expressed as "34" (0011 0100; (Same as the source UNICODE character );
(2) If the first five digits of the Unicode 16-bit character are 0, they are expressed in two bytes. The first byte starts with 110, the top five digits after the first five zeros are the same as the top five digits after the source character. The second byte starts with "10, the next six digits are the same as the lower six digits in the source character. For example, "\ u025d" (0000 0010 0101 1101) is converted to "c99d" (1100 1001 1001 1101 );
(3) If the two rules are not met, they are represented in three bytes. The first byte starts with "1110", and the last four digits are the four-byte height of the source character. The second byte starts with "10", and the last six digits are the six digits in the middle of the source character; the third byte starts with "10" and the last six digits are the lower six digits of the source character. For example, "\ u9da7" (1001 1101 1010 0111 ), convert to "e9b6a7" (1110 1001 1011 0110 1010 );
This can describe the relationship between Unicode and UTF in Java programs. Although it is not absolute: When a string is running in memory, it is represented as Unicode code. When it is saved to a file or other media, UTF is used. This conversion process is completed by writeutf and readutf.
Well, the basic discussion is almost done. Let's start with the topic below.
Think of this problem as a black box first. First look at the first-level representation of the black box:
Input (charseta)-> process (UNICODE)-> output (charsetb)
Simple: This is an IPO model, that is, input, processing, and output. The same content must be converted from charseta to Unicode to charsetb.
Let's look at the second-level representation:
Sourcefile (JSP, Java)-> class-> output
In this figure, we can see that the input is the JSP and Java source files. During the processing, the class file is used as the carrier and then output. Further refined to three levels:
JSP-> Temp File-> class-> browser, OS console, DB
APP, servlet-> class-> browser, OS console, DB
This figure is more clear. The JSP file is a Java file in the middle, and then the class is generated. Servlet and common apps directly compile and generate classes. Then, output the data from the class to the browser, console, or database.
JSP: process from source file to class
The source file of JSP is a text file ending with ". jsp. This section describes the JSP file interpretation and compilation process, and tracks Chinese changes.
1. the JSP Conversion Tool (jspc) provided by the JSP/servlet engine searches for <% @ page contenttype = "text/html; charset = charset specified in <JSP-charset> "%>. If <JSP-charset> is not specified in the JSP file, take the default setting file. Encoding in JVM, which is typically a ISO8859-1;
2. jspc uses the command "javac-encoding <JSP-charset>" to explain all the characters in the JSP file, including Chinese and ASCII characters, convert these characters into Unicode characters, convert them to UTF format, and save them as java files. When converting an ASCII character to a Unicode character, simply add "00" before it, for example, "a" and convert it to "\ u0041" (no reason is required, unicode code table ). Then, after the conversion to UTF, it is changed back to "41! This is why java files generated by JSP can be viewed in a common text editor;
3. The engine uses commands equivalent to "javac-encoding Unicode" to compile Java files into class files;
Let's take a look at the conversion of Chinese Characters in these processes. There areSource code:
<% @ Page contenttype = "text/html; charset = gb2312" %>
<HTML> <body>
<%
String A = "Chinese ";
Out. println ();
%>
</Body>
This code is written on ultraedit for Windows. After saving, the hexadecimal code of the Chinese character is "D6 D0 ce C4" (gb2312 encoding ). According to the table, the Unicode code of the Chinese character is "\ u4e2d \ u6587", which is "E4 B8 ad E6 96 87" in UTF ". Open the Java file generated by the engine, transformed from a JSP file, and find that the word "Chinese" is indeed replaced by "E4 B8 ad E6 96 87, check the class file compiled by the Java file and find that the result is exactly the same as that in the Java file.
Let's look at the case where the charset specified in JSP is the ISO-8859-1.
<% @ Page contenttype = "text/html; charset = ISO-8859-1" %>
<HTML> <body>
<%
String A = "Chinese ";
Out. println ();
%>
</Body>
Similarly, this file is written using ultraedit, and the word "Chinese" is also saved as gb2312 encoded "D6 D0 ce C4 ". First simulate the process of the generated Java file and class file: jspc uses ISO-8859-1 to explain "Chinese" and map it to Unicode. Since the ISO-8859-1 is 8-bit and Latin, the ing rule is to add "00" before each byte, so, the Unicode code after ing should be "\ u00d6 \ u00d0 \ u00ce \ u00c4". After conversion to UTF, it should be "C3 96 C3 90 C3 8e C3 84 ". Okay. Open the file and check that the "Chinese" in the Java and class files are actually "C3 96 C3 90 C3 8e C3 84 ".
If <JSP-charset> is not specified in the above Code, the first line is written as "<% @ page contenttype =" text/html "%>", jspc uses file. encoding settings to interpret JSP files. On Redhat 6.2, the processing results are exactly the same as those specified as ISO-8859-1.
So far, we have explained the ing process of Chinese characters during the transition from a JSP file to a class file. One sentence: From "jspcharset to Unicode to UTF ". The following table summarizes the process:
Table 2 conversion process from JSP to class
In the JSP-charset JSP file, the class file in the Java File
Gb2312 D6 D0 ce C4 (gb2312) from \ u4e2d \ u6587 (UNICODE) to E4 B8 ad E6 96 87 (UTF) E4 B8 ad E6 96 87 (UTF)
ISO-8859-1 D6 D0 ce C4
(Gb2312) from \ u00d6 \ u00d0 \ u00ce \ u00c4 (UNICODE) to C3 96 C3 90 C3 8e C3 84 (UTF) C3 96 C3 90 C3 8e C3 84 (UTF)
None (default = file. Encoding) Same ISO-8859-1 with ISO-8859-1 same ISO-8859-1
Next, we will first discuss the servlet conversion process from a Java file to a class file, and then explain how to output the class file to the client. The reason for this arrangement is that JSP and Servlet have the same processing method in output.
Servlet: process from source file to class
The Servlet Source file is a text file ending with ". Java. This section describes the servlet compilation process and tracks Chinese changes.
Use "javac" to compile the servlet source file. Javac can contain the "-encoding <compile-charset>" parameter, which means "interpreting the Serlvet source file with the encoding specified in <compile-charset> ".
When the source file is compiled, <compile-charset> is used to interpret all characters, including Chinese and ASCII characters. Then convert the character constant to the Unicode character, and finally convert the Unicode to UTF.
In servlet, you can also set the charset of the output stream. Before outputting the result, call the setcontenttype method of httpservletresponse to achieve the same effect as setting <JSP-charset> in JSP, which is called <servlet-charset>.
Note: three variables are mentioned in this article: <JSP-charset>, <compile-charset>, and <servlet-charset>. The JSP file is only related to <JSP-charset>, while <compile-charset> and <servlet-charset> are only related to servlet.
See the following example:
Import javax. servlet .*;
Import javax. servlet. http .*;
Class testservlet extends httpservlet
{
Public void doget (httpservletrequest req, httpservletresponse resp)
Throws servletexception, java. Io. ioexception
{
Resp. setcontenttype ("text/html; charset = gb2312 ");
Java. Io. printwriter out = resp. getwriter ();
Out. println ("<HTML> ");
Out. println ("# Chinese #");
Out. println ("}
}
This file is also written in ultraedit for Windows, where the word "Chinese" is saved as "D6 D0 ce C4" (gb2312 encoding ).
Start compilation. The following table lists the hexadecimal codes of the "Chinese" characters in the class file when <compile-charset> is different. <Servlet-charset> does not play any role during compilation. <Servlet-charset> only affects the output of class files. In fact, <servlet-charset> and <compile-charset> are used together, this achieves the same effect as <JSP-charset> In the JSP file, because <JSP-charset> affects both compilation and class file output.
Table 3 transition of "Chinese" from Servlet Source file to class
Equivalent Unicode code in the class file in the compile-charset Servlet Source File
Gb2312 D6 D0 ce C4
(Gb2312) E4 B8 ad E6 96 87 (UTF) \ u4e2d \ u6587 (= "Chinese" in UNICODE ")
ISO-8859-1 D6 D0 ce C4
(Gb2312) C3 96 C3 90 C3 8e C3 84 (UTF) \ u00d6 \ u00d0 \ u00ce \ u00c4 (one 00 is added before D6 D0 ce C4)
None (default) D6 D0 ce C4 (gb2312) Same ISO-8859-1 with ISO-8859-1
The compilation process of a common Java program is exactly the same as that of a servlet.
Is the Chinese Representation in the class file clearly revealed? OK. Let's take a look at how the class outputs Chinese characters?
Class: Output string
As mentioned above, strings are encoded in Unicode in memory. As for what unicode encoding represents, it depends on the character set from which it maps, that is, its ancestor. This is like when I checked my luggage, it looked like a paper box. What is contained in it depends on what is actually mailed by the mail recipient.
Let's take a look at the example above. If you encode a string of Unicode code "00d6 00d0 00ce 00c4", if you do not convert it, You can directly compare it with the Unicode code table, is four characters (and special characters); If you map it with the "ISO8859-1", then directly remove the previous "00" to get "D6 D0 ce C4 ", this is the four characters in the ASCII code table. If we map it as gb2312, the result may be a lot of garbled characters, because it may not exist in gb2312 (or may) the character corresponds to characters such as 00d6 (if it does not match, it will get 0x3f, that is, the question mark. if it corresponds to it, it is estimated that the characters such as 00d6 are too forward and are also some special characters, the encoding of real Chinese Characters in Unicode starts from 4e00 ).
As you can see, the same UNICODE character can be interpreted as different. Of course, one of these is the expected result. In the above example, "D6 D0 ce C4" should be what we want. When "D6 D0 ce C4" is output to IE, you can view it in simplified Chinese, then we can see the clear words "Chinese. (Of course, if you have to use the "Western European character", there is no way, you will not get anything at the time and place.) Why? Because "00d6 00d0 00ce 00c4" was originally converted from the ISO8859-1.
The following conclusions are given:
Before the class outputs a string, the Unicode string is re-generated to the byte stream according to an internal code. Then, the byte stream is input, which is equivalent to a step of "string. getbytes (???)" Operation .??? Represents a character set.
For servlet, this internal code is the internal code specified in the httpservletresponse. setcontenttype () method, that is, the <servlet-charset> defined above.
For JSP, this internal code is the internal code specified in <% @ page contenttype = "" %>, that is, the <JSP-charset> defined above.
If it is a Java program, then this internal code is the internal code specified in file. encoding, the default is the ISO8859-1.
When the output object is a browser
Take the popular Browser IE as an example. Internet Explorer supports multiple internal codes. If IE receives a byte stream "D6 D0 ce C4", you can try to view it with various internal codes. You will find that you can get the correct results when using "simplified Chinese. Because "D6 D0 ce C4" is originally the encoding of "Chinese" in simplified Chinese.
OK, read it completely.
JSP: the source file is a text file in gb2312 format, and the JSP source file contains the Chinese characters "Chinese ".
If <JSP-charset> is set to gb2312, the conversion process is as follows.
Table 4 Changes in JSP-charset = gb2312
Sequence Number step description result
1. Compile the JSP source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2. jspc converts the JSP source file to a temporary Java file, maps the string to Unicode according to gb2312, and writes the string to the Java file E4 B8 ad E6 96 87 in UTF format.
3. Compile the temporary Java file into the class file E4 B8 ad E6 96 87
4. When running, read the string from the class file using readutf, and the Unicode code 44e 2D 65 87 (4e2d = medium 6587 = text in UNICODE) is used in the memory)
5 convert Unicode to byte stream D6 D0 ce C4 Based on JSP-charset = gb2312
6. output the byte stream to IE and set the IE encoding to gb2312 (the author presses: This information is hidden in the HTTP header) D6 D0 ce C4
7. Use "simplified Chinese" in IE to view the result "Chinese" (displayed correctly)
If you specify <JSP-charset> as the ISO8859-1, the conversion process is as follows.
Table 5 process of change when JSP-charset = ISO8859-1
Sequence Number step description result
1. Compile the JSP source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2 jspc converts the JSP source file to a temporary Java file, maps the string to Unicode according to the ISO8859-1, and writes C3 96 C3 90 C3 8e C3 84 in the Java file in UTF format
3. Compile the temporary Java file into the class file C3 96 C3 90 C3 8e C3 84
4. When running, read the string from the class file using readutf. In the memory, the Unicode code 00 D6 00 D0 00 ce 00 C4 is used.
(Nothing !!!)
5 convert Unicode to byte stream D6 D0 ce C4 according to JSP-charset = ISO8859-1
6 output the byte stream to IE and set ie encoding to ISO8859-1 (by: This information is hidden in the HTTP header) D6 D0 ce C4
7 ie uses the "Western European character" to check the garbled result. It is actually four ASCII characters, but it is a strange display because it is greater than 128.
8. Change the Page code of IE to "simplified Chinese" and "Chinese" (displayed correctly)
Strange! Why the <JSP-charset> set to gb2312 and ISO8859-1 is a sample, can be correctly displayed? Because steps 2nd and 5th in table 4 and table 5 are mutually "offset. It is inconvenient to add Step 1 when you specify it as a ISO8859-1.
Check whether <JSP-charset> is specified.
Table 6 changes when JSP-charset is not specified
Sequence Number step description result
1. Compile the JSP source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2 jspc converts the JSP source file to a temporary Java file, maps the string to Unicode according to the ISO8859-1, and writes C3 96 C3 90 C3 8e C3 84 in the Java file in UTF format
3. Compile the temporary Java file into the class file C3 96 C3 90 C3 8e C3 84
4. When running, read the string from the class file using readutf. In the memory, the Unicode code 00 D6 00 D0 00 ce 00 C4 is used.
5 convert Unicode to byte stream D6 D0 ce C4 according to JSP-charset = ISO8859-1
6. output the byte stream to IE, D6 D0 ce C4
7. ie uses the page encoding when sending the request to view the results based on the situation. If it is in simplified Chinese, it will be correctly displayed. Otherwise, you need to execute Step 5 in table 5.
Servlet: the source file is a Java file in the format of gb2312. The source file contains the Chinese characters "Chinese ".
If <compile-charset> = gb2312, <servlet-charset> = gb2312
Table 7 Changes in compile-charset = servlet-charset = gb2312
Sequence Number step description result
1. Write the Servlet Source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2. Use javac-encoding gb2312 to compile the Java source file into the class file E4 B8 ad E6 96 87 (UTF)
3. When running the program, read the string from the class file using readutf. In the memory, the Unicode code is 4E 2D 65 87 (UNICODE)
4 convert Unicode to byte stream D6 D0 ce C4 (gb2312) based on servlet-charset = gb2312)
5. output the byte stream to IE and set the IE encoding attribute to servlet-charset = gb2312 D6 D0 ce C4 (gb2312)
6. Use "simplified Chinese" in IE to view the result "Chinese" (displayed correctly)
If <compile-charset> = ISO8859-1, <servlet-charset> = ISO8859-1
Table 8 changes in compile-charset = servlet-charset = ISO8859-1
Sequence Number step description result
1. Write the Servlet Source file and save it in gb2312 format D6 D0 ce C4
(D6d0 = medium cec4 = text)
2 Use javac-encoding ISO8859-1 to compile the Java source file into the class file C3 96 C3 90 C3 8e C3 84 (UTF)
3. When running, read the string from the class file using readutf. In the memory, the Unicode code 00 D6 00 D0 00 ce 00 C4 is used.
4 convert Unicode to byte stream D6 D0 ce C4 according to servlet-charset = ISO8859-1
5 output the byte stream to IE and set the IE encoding attribute to servlet-charset = ISO8859-1 D6 D0 ce C4 (gb2312)
6. Use the "Western European character" in IE to view the garbled characters (the reason is the same as table 5)
7. Change the Page code of IE to "simplified Chinese" and "Chinese" (displayed correctly)
If you do not specify compile-charset or servlet-charset, the default value is ISO8859-1.
When compile-charset = servlet-charset, steps 1 and 2 can be reversed and offset, and the results are correct. You can try to write compile-charset <> servlet-charset, which is definitely incorrect.
When the output object is a database
When outputting data to a database, the principle is the same as when outputting data to a browser. This section uses Servlet as an example. You can deduce the JSP information by yourself.
Suppose there is a servlet that can receive a Chinese character string from the client (ie, simplified Chinese), and then write it into the database where the internal code is ISO8859-1, then retrieve the string from the database and display it to the client.
Table 9 the output object is the change process in the database (1)
Sequence Number step description result Field
1. Enter "Chinese" in IE D6 D0 ce C4 IE
2 ie converts the string into UTF and sends it to the transmission stream E4 B8 ad E6 96 87
3 servlet receives the input stream and reads 4E 2D 65 87 (UNICODE) servlet with readutf
4 In servlet, the programmer must restore the string to the byte stream D6 D0 ce C4 Based on gb2312.
5 programmers generate new string 00 D6 00 D0 00 ce 00 C4 based on the database internal code ISO8859-1
6. submit the new string to JDBC 00 D6 00 D0 00 ce 00 C4.
7 JDBC detected that the database code is ISO8859-1 00 D6 00 D0 00 ce 00 C4 JDBC
8 JDBC generates byte streams D6 D0 ce C4 Based on the received string according to the ISO8859-1
9 JDBC writes the byte stream to the database D6 D0 ce C4
10 complete data storage D6 D0 ce C4 Database
The following process is used to retrieve data from the database:
11 JDBC extracts byte streams from the database D6 D0 ce C4 JDBC
12 JDBC generates a string according to the character set ISO8859-1 of the database and submits it to servlet 00 D6 00 D0 00 ce 00 C4 (UNICODE)
13 servlet get string 00 D6 00 D0 00 ce 00 C4 (UNICODE) Servlet
14 programmers must follow the database's internal code ISO8859-1 to restore the original byte stream D6 D0 ce C4
15 programmers must generate a new 4E 2D 65 87 String Based on the client Character Set gb2312
(UNICODE)
Servlet is going to output the string to the client
16 servlet: Generate byte stream d6d0 ce C4 servlet Based on <servlet-charset>
17 servlet outputs the byte stream to IE. If <servlet-charset> is specified, it also sets the IE encoding to <servlet-charset> D6 D0 ce C4
18 ie: view the result "Chinese" based on the specified encoding or default encoding (displayed correctly) IE
To explain, steps 4th and 5th in the table are marked in red, indicating that the conversion is performed by the coders. 4th, 5 Two Steps is actually a sentence: "New String (source. getbytes (" gb2312 ")," ISO8859-1 ")". 15th, 16 two steps are also a sentence: "New String (source. getbytes (" ISO8859-1 ")," gb2312 ")". Dear reader, do you realize every detail in this code?
When the client internal code and database internal code are other values, and the output object is the process in the system console, please think for yourself. I understand the principles of the above process and believe you can easily write it out.
So far, it is time to come to an end. The end point is back to the start point, which has almost no effect on programmers.
Because we have long been accused of doing this.
The following is a conclusion.
1. In the JSP file, you must specify the contenttype. The charset value must be the same as the character set used by the client browser. For string constants, no internal code conversion is required; for string variables, you must be able to restore them to byte streams that the client can recognize Based on the character set specified in contenttype. Simply put, the string variables are based on the <JSP-charset> character set ";
2. httpservletresponse must be used in servlet. setcontenttype () is set to charset and set to be consistent with the client internal code. For string constants, you must specify encoding during javac compilation. This encoding must be the same as the character set of the platform for compiling source files, generally, it is gb2312 or GBK. For string variables, like JSP, they must be "based on the <servlet-charset> character set ".
This article from csdn blog: http://blog.csdn.net/happyxyzw/archive/2005/09/10/477024.aspx