From: http: // 115.47.70.85/ruanjiangongcheng/2011-04/2859. htm
Ref: http://www.ibm.com/developerworks/cn/java/j-lo-chinesecoding/
Abstract: This article describes the causes of Java Chinese problems and provides solutions to Chinese problems. At the same time, the problem of the length of strings mixed in Chinese and English is introduced, and the implementation of the solution is given.
With the increase of Java applications, Chinese problems related to Java are gradually exposed. As a cross-platform network programming language, Java has its own unique method for language processing, but it also brings some problems.
1. Generation of Java Chinese problems
Java uses the Unicode character encoding set to process common text encoding systems around the world. Unicode Character character encoding set is an important interaction and display of general character encoding standards, common UTF-8, UTF-16, UCS-2, UCS-4 and so on. The data intervals corresponding to the Chinese, Japanese, and Korean characters (namely the CJK Character Set) are mainly 4e00-9fff. Each character corresponds to a unique encoding. For example, the Unicode codes corresponding to the Chinese characters are 0x4e2d,
0x6587. The following code is system. Out. println (char) 0x4e2d + "" + (char) 0x6587.
The character encoding we usually use is a dual-byte character set (DBCS ). It is very different from the Unicode encoding mechanism. The problem of Chinese processing in Java is generally how to convert the byte strings encoded by DBCS into correct unicode encoded strings. All Chinese problems occur because byte strings are not correctly converted.
Common Chinese problems occur when mutual information is exchanged between operating systems in different languages.
2. Java Chinese problem solution
First, make sure that your JDK version is a stable new version, which is a prerequisite for correct processing of Java Chinese issues.
2.1 conversion between other incodes and Unicode codes
The root cause of the problem is to correctly process various internal codes and Unicode codes for mutual conversion. The string class of Java provides a conversion method. The specific usage is new string (byte [], encoding), which generates a new string for converting the specified byte array using the specified character encoding method.
For example, string abc = new string ("Hi... Chinese". getbytes ("gb2312"), "gb2312 ");
Here, "Hi... Chinese". getbytes ("gb2312") converts the string to a byte array according to the gb2312 character encoding method. Then generate a String object of ABC in gb2312 format.
2.2 let JDK compile the program using the encoding method you specified
When you use javac to compile a program, the compiler uses the system's default encoding to compile the Java program. Use the following command to compile javac-encoding gb2312 XXX. Java, and specify gb2312 encoding for compilation.
2.3. Chinese problems in JDBC
JDBC (Java database connectivity) is a unified interface for Java programs to access databases. During network transmission, most JDBC uses local encoding format to transmit Chinese characters. For example, the Chinese character "0x00005" is converted to "0x41" and "0x75" for transmission. Therefore, the characters returned by JDBC and the characters to be sent to JDBC must be converted. When JDBC is used to insert data into the database and query data, encoding and conversion are required. Therefore, when an application accesses data, it must undergo encoding conversion at the entry and exit. For Chinese data, database character encoding settings should ensure data integrity, such as gb2312, GBK, UTF-8
Are optional database codes.
For example: Convert to UTF-8 for Transmission
Sqlstr1 = new string (sqlstr1.getbytes ("gb2312"), "" UTF-8 ");
Convert to gb2312 code for display
Sqlstr2 = new string (sqlstr2.getbytes ("UTF-8"), "gb2312 ");
3. Description of failure in Java Chinese Encoding
If the encoding fails, two results are displayed: "?" Or "□ ". "?" Transcoding error; "□" indicates transcoding failure. If "?" appears, The problem can be solved only when the problem is located. If "□" is displayed, transcoding can be performed on this basis until the problem is successful.
4. String Length issues caused by Java Chinese Encoding
In addition to display problems, the Java Chinese problem processing system also has some problems in determining the length of strings containing Chinese characters. In C, a Chinese character is two bytes, while in Java programs, the length of the Chinese character varies according to the encoding. The following program shows the problem.
The test procedure is as follows. The test string is "Chinese ABC" and the test platform is "Chinese Win XP SP2.
Public class chinesecharactertest
{
Public static void main (string [] ARGs) throws exception
{
// Code by iso8859-1
String ISO = new string ("Chinese ABC". getbytes ("gb2312"), "ISO8859-1 ");
// Encoded by gb2312
String GB = new string (ISO. getbytes ("ISO8859-1"), "gb2312 ");
// UTF-8 encoding
String utf_8 = new string (ISO. getbytes ("ISO8859-1"), "UTF-8 ");
// Print the encoded string and length respectively below
System. Out. println ("ISO is:" + Iso + ", the length is:" + ISO. Length ());
System. Out. println ("GB is:" + GB + ", the length is:" + GB. Length ());
System. Out. println ("UTF-8 is:" + utf_8 + ", the length is:" + utf_8.length ());
}
}
Compile the program using gb2312 encoding. The running result is:
ISO is :???????? ABC, the length is: 7
GB is: Chinese ABC, the length is: 5
UTF-8 is :???????? ABC, the length is: 7
It can be seen that in the ISO8859-1 and UTF-8, a Chinese character is processed by two length units, in the GB encoding, a Chinese character is processed by one length unit. In this way, you must pay attention to this issue when intercepting characters. If you accidentally cut the Chinese half in a ISO8859-1 or UTF-8, a fatal error will occur.
5. Solutions to Chinese Character length problems
To better address the length of Chinese characters, we designed and implemented the following methods. The basic idea is to determine whether the character is a Chinese character based on the encoding range of the Chinese encoding before further processing.
The procedure is as follows:
Public static int getchineselength (string name, string endcoding)
Throws exception {
Int Len = 0; // defines the length of the returned string
Int J = 0;
// Get byte according to the specified encoding []
Byte [] B _name = Name. getbytes (endcoding );
While (true ){
Short tmpst = (short) (B _name [J] & 0xf0 );
If (tmpst> = 0xb0 ){
If (tmpst <0xc0 ){
J + = 2;
Len + = 2;
}
Else if (tmpst = 0xc0) | (tmpst = 0xd0 )){
J + = 2;
Len + = 2;
}
Else if (tmpst = 0xe0 ){
J + = 3;
Len + = 2;
}
Else if (tmpst = 0xf0 ){
Short tmpst0 = (short) B _name [J]) & 0x0f );
If (tmpst0 = 0 ){
J + = 4;
Len + = 2;
}
Else if (tmpst0> 0) & (tmpst0 <12 )){
J + = 5;
Len + = 2;
}
Else if (tmpst0> 11 ){
J + = 6;
Len + = 2;
}
}
}
Else {
J + = 1;
Len + = 1;
}
If (j> B _name.length-1 ){
Break;
}
}
Return Len;
}
This program is suitable for the convenience of getting the length of the string in the case of Chinese and Western text mixing, and can get the same length of the string.
6. Summary
In summary, we provide the causes and solutions for the Java Chinese character problem, and explain the impact of Java Chinese Character Processing on the string length, and provide practical solutions. Practice has proved that this method solves various problems in various project projects and truly gives full play to the advantages of Java cross-platform.
References:
[1] The Unicode Standard www.unicode.org
[2] Bill Venners inside the Java Virtual Machine Second Edition
Sun, j2sdk Simplified Chinese supplement, http://www.sun.com.cn