An in-depth discussion on Chinese character problems

Source: Internet
Author: User
Tags array character set parse error reserved
Chinese Characters | Question one, topic: about Java Chinese problems
Java's Chinese problems are more prominent, mainly reflected in the Control Panel output, JSP page output and database access. This article tries to avoid the font problem, but only the coding. In this article, you can understand the origin of Java Chinese problems, the solution to the problem, which provides a way to access the database with JDBC.

Second, the problem description:
1 in Chinese W2000 Chinese windows compiled and run, with the international version of the JDK, connected to the Chinese W2000 under the CP936 encoded SQL Server database:

J:\exercise\demo\encode\helloworld>make
Created by Xcompiler. Philosoft all Rights Reserved.
Wed May 02:54:45 CST 2001

J:\exercise\demo\encode\helloworld>run
Created by Xrunner. Philosoft all Rights Reserved.
Wed May 02:51:33 CST 2001
Chinese
[b@7bc8b569
[b@7b08b569
[b@7860b569
Chinese
Chinese
????
Chinese
Chinese
????
??
??
??

2 If the language in the Chinese W2000 (encoded 437) under the compiler, with the Java run, because there is no font and can not display, if the same as above in Chinese W2000 Chinese window run, the output is:

J:\exercise\demo\encode\helloworld>run
Created by Xrunner. Philosoft all Rights Reserved.
Wed May 02:51:33 CST 2001
????
[b@7bc0b66a
[b@7b04b66a
[b@7818b66a
????
????
????
????
????
????
Chinese
Chinese
????

III) analysis

1 There are garbled characters (i.e.?) )。 Because it only appears? Instead of a small box, there is only a problem with coding, not a font problem. In code, if the conversion from a character set to a character set, the more typical conversion from GB2312 to Iso8859_1 (that is, ASCII), so many Chinese characters (half Chinese characters) can not be mapped to the West character, in this case, the system will use these characters? Replace. Similarly, there are small character sets can not be to the large character set, the specific reason here is unknown.

2 There has been a compilation of Chinese environment, Chinese language environment when the display of Chinese characters have the correct and incorrect place, the same, in the context of Western languages compiled, in the Chinese environment when the operation also appeared similar. This is the result of automatic (default) or manual (also new String (bytes[,encode)) and bytes getBytes ([encode]) transcoding.

2.1 In the process of displaying the Java source file-->javac-->class-->java-->getbytes ()-->new String ()-->, each step has a coded conversion process. This process always exists, but sometimes the default parameters are used. Here's a step-by-step analysis of why the above situation occurs.

2.2) Here is the source code:

Helloworld.java:
------------------------
public class HelloWorld
{
public static void Main (string[] argv) {
try{
System.out.println ("Chinese");//1
System.out.println ("Chinese". GetBytes ());//2
System.out.println ("Chinese". GetBytes ("GB2312"));//3
System.out.println ("Chinese". GetBytes ("Iso8859_1"));//4

System.out.println (New String ("Chinese". GetBytes ());//5
System.out.println (New String ("Chinese". GetBytes (), "GB2312"));//6
System.out.println (New String ("Chinese". GetBytes (), "iso8859_1"));//7

System.out.println (New String ("Chinese". GetBytes ("GB2312"));//8
System.out.println (New String ("Chinese". GetBytes ("GB2312"), "GB2312");//9
System.out.println (New

String ("Chinese". GetBytes ("GB2312"), "iso8859_1");//10

System.out.println (New String ("Chinese". GetBytes ("Iso8859_1"));//11
System.out.println (New

String ("Chinese". GetBytes ("Iso8859_1"), "GB2312");//12
System.out.println (New

String ("Chinese". GetBytes ("Iso8859_1"), "iso8859_1");//13
}
catch (Exception e) {
E.printstacktrace ();
}
}
}

For convenience, an operation sequence number is added to the back of each conversion, 1,2,..., 13 respectively.

2.3 It should be explained that the javac is read into the source file in the system default encoding and then encoded in Unicode. Java is also encoded in Unicode while Java is running, and the default input and output is the default encoding of the operating system, that is, in new String (Bytes[,encode]), the system considers that a byte stream encoded as encode is entered, in other words , if you press encode to translate bytes to get the correct results, this result will be saved in Java, it will be converted from this encode to Unicode, which means that there are bytes-->encode characters-->unicode character conversion While in String.getbytes ([encode]), the system does a conversion of a Unicode character-->encode character-->bytes.

In this example, except when the English window is encoded, the default encoding is GBK in the case (in this case, we will treat GBK and GB2312 as equals).

2.4) because in a translation that is not indicated in the two code implementations above, if encode is not specified, the system will use the default encoding (here is GBK), we think the 5,6,7 and 8,9,10 are the same, 8 and 9, 11 and 12 are the same, so we will discuss only 1, 9,10,12,13. The 2,3,4 is just for testing, not within our discussion.

2.5 Below we will track the process of "medium" in the program, we first say in the Chinese window under the compile and run process, notice in the following letter subscript, I consciously use some numbers to indicate the same, different or related 2.5.1, we first take the above 13 code snippet in the code 9 for example:

Step Content Location Description
01:C1 Helloworld.java C1 refers to a GBK character
02:U1 Javac read U1 refers to a Unicode character
03:C1 getBytes () first Java first and operating system communication
04:B1,B2 getBytes () step two and then return the byte array
05:C1 New String () First Java communicates with the operating system first
06:U1 new String () and then returns the character
07:c1 println (String) can display the word "medium" with the same content as the original

2.5.2) and then, for example, in code segment 10, we notice that just:

Step Content Location Description
01:C1 Helloworld.java C1 refers to a GBK character
02:U1 Javac read U1 refers to a Unicode character
03:C1 getBytes () first Java first and operating system communication
04:B1,B2 getBytes () step two and then return the byte array
05:C3,C4 New String () First Java first communicates with the operating system, parsing error
06:u5,u6 new String () and then returns the character
07:c3,c4 println (String) because the middle word is divided into two halves, there are no characters in the Iso8859_1

can be mapped on, so it appears as "??". In the example above,
"Chinese" Two words will be shown as "???? ”
2.5.3) in the full Chinese mode of other situations like, I won't say more

2.6) We then look at why the classes compiled in the Western DOS window also appear in the Chinese window, especially why the Chinese characters can be displayed correctly in some cases.

2.6.1) Let's take code snippet 9 for example:

Step Content Location Description
01:C1C2 Helloworld.java c1c2 refers to a iso8859_1 character, the word "medium" is taken apart
02:u3u4 Javac read u1u2 refers to a Unicode character
03:c5c6 GetBytes () The first step Java and the operating system to communicate, then parse error
04:b5b6b7b8 getBytes () step two and then return the byte array
05:c5c6 New String () First Java communicates with the operating system first
06:u3u4 new String () and then returns the character
07:c5c6 println (String), although two characters, is not the original "two iso8859_1 words

Character ", but" two bgk characters "," Medium "is shown as"?? ”
and "Chinese" is shown as "???? ”

2.6.2) Let's take code snippet 12 For example, because it displays Chinese characters correctly

Step Content Location Description

01:C1C2 Helloworld.java c1c2 refers to a iso8859_1 character, the word "medium" is taken apart
02:u3u4 Javac read u1u2 refers to a Unicode character
03:C1C2 GetBytes () The first step Java and the operating system to communicate (note or correct Oh!) )
04:b5b6 getBytes () The second step then returns a byte array (this is a critical step!). )
05:C12 New String () first Java and the operating system to communicate (this is a more critical step, Java already know b5b6 to parse into a Chinese character!) )
06:u7 The second step of the new String () and returns the character (really an item two!). U7 contains u3u4 information)
07:C12 println (String) This is the original "medium" word, very wronged by the Javac wronged a back, but the programmer to get the wrong one! Of course, "Chinese" two words can be correctly displayed!

3 Why is it sometimes used in JDBC
New String (recordset.getbytes (int) [, encode])
recordset.getsting (int)
Recordset.setbytes (String.getbytes ([encode]))
And
Recordset.setstring (String)
The time will appear garbled?

In fact, the problem occurred in the writing JDBC also considered the coding problem, it read data from the database, may have made a GB2312 from the default encoding to Unicode conversion, my weblogic for SQL server jdbc Driver is like this, when I read the string, the issue of reading is not the correct Chinese characters, hateful is I can write the Chinese character string directly, which makes people a bit difficult to accept!
That is, we have to do the transcoding at the time of reading or writing, although the transcoding is sometimes not so obvious, because we use the default encoding for transcoding. JDBC Driver The operation, we can only go into the source code inside to be clear, right?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.