Analysis and solution of Chinese character problem in Java programming technology (Turn)

Source: Internet
Author: User
Tags array character set sql string version tostring access stringbuffer
Programming | Chinese Characters | solving | problems

Analysis and solution of Chinese character problem in Java programming technology

De Minghui
Freelance writer
November 8, 2000
content:The common sense of encoding a preliminary understanding of Java Chinese problems surface analysis and processing of Java Chinese problems root cause analysis and solution Java Servlet Chinese problem source modification Servlet.jar Chinese garbled processing function
In the Java language programming, we often encounter the problem of Chinese character processing and display. A large pile of garbled reading is certainly not what we would like to see the display effect, how can we make those Chinese characters correctly displayed? The default encoding for the Java language is Unicode, and the files and databases that we Chinese usually use are encoded based on GB2312 or BIG5, how can we properly select the encoding method and correctly handle Chinese character encoding? This article will start with encoding common sense, combine Java programming example, analyze above two questions and propose solution to them.

Now that the Java programming language has been widely used in the Internet world, it has taken into account the support of non-English characters as long as Sun has developed the Java language. Sun's published Java runtime Environment (JRE) itself is in English and international editions, but only international editions support non-English characters. However, in the application of the Java programming language, support for Chinese characters is not as perfect as is claimed in the Java Soft Standard specification, because there are more than one character set, and different operating systems support the characters differently. So there are a lot of problems with encoding handling that are bothering us in our application development. There are many answers to these questions, but are relatively trivial, can not meet the urgent need to solve the problem of the desire, on the Java Chinese problem of the system research is not much, this article from encoding common sense, analysis of Java Chinese problems, I hope to solve this problem to help.
encoding's common sense
As we know, English characters are generally expressed in one byte, and the most common encoding method is ASCII. But a byte can only distinguish between 256 characters, and thousands of Chinese characters, so now all the Chinese characters in Double-byte, in order to be able to separate from the English characters, the highest bit of each byte must be 1, so that the double byte can represent up to 64K characters. We often encounter the coding methods are GB2312, BIG5, UNICODE and so on. For detailed information on the specific coding methods, interested readers can access the relevant information. I have a superficial talk about GB2312 and UNICODE, which are closely related to us. GB2312 code, the People's Republic of China's national standard Chinese character Information exchange code, is a national standard of the People's Republic of China issued by the General Administration of Simplified Chinese character encoding, access to mainland China and Singapore, referred to as GB code. In two bytes, the value of the first byte (high byte) is the area code value Plus (20H), and the second byte (low byte) has a bit value plus (20H), which is used to represent the encoding of a Chinese character. UNICODE code is a multi-byte equal-length code proposed by Microsoft to solve the problem of multi-country characters, which is compatible with the policy of "0" byte before the English character. If the ASCII code for "A" is 0x41,unicode, it is 0x00,0x41. Using special tools, various encodings can be converted to each other.
a preliminary understanding of Java Chinese problems
When we develop applications based on the Java programming language, we inevitably have to deal with Chinese. The default encoding of the Java programming language is UNICODE, and we usually use the database and files are based on GB2312 encoding, we often encounter such a situation: browsing based on JSP technology site to see is garbled, the file opened after the read is garbled, by the Java The contents of the modified database cannot continue to provide information correctly when applied on other occasions.
String senglish = "Apple";
String Schinese = "Apple";
String s = "Apple apples";
The length of the senglish is 5,schinese length is 4, and s default length is 14. For Senglish, the various classes in Java are very well supported and must be displayed correctly. But for Schinese and S, although the Java Soft declares that Java's base class has taken into account support for multinational characters (the default Unicode encoding), the default encoding for the operating system is not Unicode, but GB code. From Java source code to get the correct result, go through the process of "Java source code-> Java bytecode->, virtual machine-> operating system-> display device". In each step of the process, we must correctly handle the encoding of Chinese characters, so that the final display results can be correct.
"Java source code-> java bytecode", the standard Java compiler javac use of the character set is the system default character set, such as on the Chinese Windows operating system is GBK, and on the Linux operating system is iso-8859-1, so you will find In the Linux operating system compiled classes in the source file in the Chinese characters are problematic, the solution is to compile the time to add encoding parameters, so as to be independent of the platform. Usage is
Javac–encoding GBK.
"Java bytecode-> virtual machine-> operating System", the Java Runtime Environment (JRE) is in English and international editions, but only international editions support non-English characters. The Java Development Kit (JDK) certainly supports multinational characters, but not all computer users have JDK installed. Many operating systems and applications in order to better support Java, are embedded in the international version of the JRE, for their support of multinational characters to provide convenience.
"Operating system-> display device", for Chinese characters, the operating system must support and be able to display it. If the English operating system does not match the special application software, it is definitely not able to display Chinese.
Another problem is that in the Java programming process, the Chinese characters in the correct encoding conversion. For example, when you export a Chinese string to a Web page, whether you are using the
Out.println (string); String is a Chinese-language strings
or use
<%=string%&gt, all must be converted to UNICODE to GBK, either manually, or automatically. In JSP 1.0, you can define an output character set to enable automatic conversion of the inner code. Usage is
<% @page contenttype= "text/html;charset=gb2312"%>
However, there is no support for the output character set in some JSP versions (for example, JSP 0.92), which requires manual encoding of the output, and there are many methods. The most common method is
String S1 = request.getparameter ("keyword");
String s2 = new String (S1.getbytes ("iso-8859-1"), "GBK");
The GetBytes method is used to convert Chinese characters to "iso-8859-1" encoding into byte arrays, and "GBK" is the target encoding method. We read the Chinese string S1 from a iso-8859-1-encoded database, and through the above conversion process, the Chinese string S2 can be displayed correctly in the operating system and application software that supports the GBK character set.
surface analysis and treatment of Java Chinese problems
Background
Development environment
JDK1.15
Vcafe2.0
Jpadpro
Server-side
NT IIS
Sybase System
Jconnect (JDBC)
Client
IE5.0
Pwin98

. class file stored on the server side, by the client's browser to run the applet, the applet only plays the role of the main program such as the FRAME class. The interface includes Textfield, Textarea,list,choice and so on.
I. Taking Chinese
When you use JDBC to execute a SELECT statement to read data from the server side (Chinese), the data is added to the TextArea (TA) using the APPEND method and cannot be displayed correctly. But when added to the list, most Chinese characters are displayed correctly.
The data is converted to a byte array by "iso-8859-1" encoding, and then converted to STRING by the system default encoding (default Character Encoding), which is displayed correctly in TA and list.
The procedure paragraph is as follows:
DBSTR2 = results.getstring (1);
After reading the "result" from DB server,converting it to string.
Dbbyte1 = Dbstr2.getbytes ("iso-8859-1");
DBSTR1 = new String (dbbyte1);
No system default encoding is used when converting strings, and "GBK" or "GB2312" are used directly, and in both A and B, there is no problem with data being fetched from the database.
Ii. writing Chinese to the database
Processing mode and "take Chinese" inverse, first of all, the SQL statement by the system default encoding into a byte array, and then "iso-8859-1" encoded into STRING, and finally sent to execute, the Chinese information can be correctly written to the database.
The procedure paragraph is as follows:
sqlstmt = Tf_input.gettext ();
Before sending statement to DB server,converting it to SQL statement.
Dbbyte1 = Sqlstmt.getbytes ();
sqlstmt = newstring (dbbyte1, "iso-8859-1");
_stmt = _con.createstatement ();
_stmt.executeupdate (SQLSTMT);
......
Problem: If there is a CLASSPATH on the client computer pointing to JDK CLASSES. ZIP (known as A case), the above program code can be executed correctly. However, if the client has only a browser and no JDK and CLASSPATH (known as the B case), the kanji cannot be converted correctly.
Our analysis:
1. After testing, in the case of A, the system's default encoding is GBK or GB2312 when the program is run. In the case of B, the following error message appears in the browser's JAVA console when the program starts:
Can ' t find resource for Sun.awt.windows.awtLocalization_zh_CN
The system's default encoding is "8859-1".
2. If you do not use the system default encoding when converting a string, but instead use "GBK" or "GB2312" directly, the program still works in a case, and in the case of B, the system appears to be in error:
Unsupportedencodingexception.
3. On the client, CLASSES the JDK. After the ZIP is uncompressed, it is placed in another directory, and CLASSPATH only contains the directory. Then, gradually delete the directory. class file, the other side runs the test program, and finally found that in more than 1000 CLASS files, only one is essential, the file is:
Sun.io.CharToByteDoubleByte.class.
Copy the file to the server side and the other classes, and IMPORT it at the beginning of the program, and the program will still not function in B.
4. In the case of A, if the sun.io.CharToByteDoubleByte.class is removed in the classpth, the program is run with the default encoding "8859-1" or "GBK" or "GB2312".
If the JDK version is more than 1.2, the problem encountered in B is well resolved, the test steps above, the interested readers can try.
root cause analysis and solution of Java Chinese problems
Under Simplified Chinese MS Windows + JDK 1.3, you can use System.getproperties () to get some of the basic properties of the Java runtime environment, and class Poorchinese can help us get these properties.
Source code for Class Poorchinese:
public class Poorchinese {
public static void Main (string[] args) {
System.getproperties (). List (System.out);
}
}
After executing the Java Poorchinese, we will get:
The value of the system variable file.encoding is GBK, the user.language value is en, and the user.region value is CN, and the value of these system variables determines the system's default encoding is GBK.
In the above system, the following code converts GB2312 files into Big5 files that help us understand the transformation of encoding in Java:

Import java.io.*;
Import java.util.*;

public class Gb2big5 {

static int icharnum=0;

public static void Main (string[] args) {
System.out.println ("Input GB2312 file, output Big5 file.");
if (args.length!=2) {
System.err.println ("Usage:jview gb2big5 gbfile big5file");
System.exit (1);
}
String inputstring = Readinput (Args[0]);
WriteOutput (Inputstring,args[1]);
System.out.println ("Number of Characters in file: +icharnum+");
}

static void WriteOutput (String str, string stroutfile) {
try {
FileOutputStream fos = new FileOutputStream (stroutfile);
Writer out = new OutputStreamWriter (FOS, "Big5");
Out.write (str);
Out.close ();
}
catch (IOException e) {
E.printstacktrace ();
E.printstacktrace ();
}
}

static string Readinput (String strinfile) {
StringBuffer buffer = new StringBuffer ();
try {
FileInputStream fis = new FileInputStream (strinfile);
InputStreamReader ISR = new InputStreamReader (FIS, "GB2312");
Reader in = new BufferedReader (ISR);
int ch;
while (ch = in.read ()) >-1) {
Icharnum + 1;
Buffer.append ((char) ch);
}
In.close ();
return buffer.tostring ();
}
catch (IOException e) {
E.printstacktrace ();
return null;
}
}
}

The process of encoding conversion is as follows:
ByteToCharGB2312 CharToByteBig5
GB2312------------------>unicode------------->big5
Executes the Java gb2big5 gb.txt big5.txt, if gb.txt content is "Today Wednesday", the resulting file Big5.txt characters can be displayed correctly, and if the Gb.txt content is "Happy Valentine's Day", the resulting file big5.txt Characters that correspond to "section" and "Le" are all symbols "? (0x3F), it is visible that the two basic classes of sun.io.ByteToCharGB2312 and SUN.IO.CHARTOBYTEBIG5 are not well prepared.
As in the example above, Java's basic classes may also have problems. Since internationalization is not done domestically, the support for Chinese characters is not as perfect as the Java Soft claims to be without rigorous testing before these basic classes are released. Not long ago, a technical friend of mine sent me a letter saying that he had finally found root of Java Servlet Chinese problem。 For two weeks, he has been plagued by a Chinese problem with the Java Servlet, because every string that has a character in it must be cast to get the correct result (as if it were the only solution known to all). Later, he did not want to continue to do so, because such things really should not be the work of senior programmers, he found the Servlet decoding the source code analysis, because he suspected that the problem is in the decoding part. After four hours of struggle, he finally found the root of the problem. It turns out his suspicions are correct, and the Servlet's decoding part does not consider two bytes at all, and directly treats%xx as a single character. (The original Java Soft will also make this low-level error!) )
If you are interested in this issue or if you are experiencing the same problems, you can follow his steps Right Servlet.jar for modification
Locate the static private String parsename in source code httputils, copy SB (StringBuffer) to byte bs[before returning, and then return to New String (BS, "GB2312" )。 After making the above changes, you need to decode yourself:
HashTable form=httputils. Parsequerystring (Request.getquerystring ()) or
Form=httputils.parsepostdata (...)
Never forget to put it in the Servlet.jar after compiling it.
Summary of Java Chinese issues
Java programming language grew up in the Web world, which requires Java to have good support for multinational characters. The Java programming language adapts the computational network demand, has laid the solid foundation for it to grow rapidly in the network world. The Java Creator (Java Soft) has taken into account the Java programming language's support for multinational characters, but there are a lot of flaws in the solution now, and we need to put some compensatory measures into it. And the World Organization for Standardization is trying to unify all human words in one encoding, one of which is ISO10646, which represents one character in four bytes. Of course, before this scenario is adopted, it is hoped that the Java Soft can rigorously test its products and bring more convenience to the user.
attachedOne for fetching from the database and the network processing functions of Chinese garbled characters, the incoming argument is a problematic string, and the argument is a string that has been resolved by the problem.
String Parsechinese (string in)
{
String s = null;
byte temp [];
if (in = = null)
{
SYSTEM.OUT.PRINTLN ("Warn:chinese null founded!");
return new String ("");
}
Try
{
Temp=in.getbytes ("iso-8859-1");
Temp=in.getbytes ("iso-8859-1");
s = new String (temp);
}
{
SYSTEM.OUT.PRINTLN ("Warn:chinese null founded!");
return new String ("");
}
Try
{
Temp=in.getbytes ("iso-8859-1");
s = new String (temp);
}
catch (Unsupportedencodingexception e)
{
System.out.println (E.tostring ());
}
return s;
}

reference materials
    • BBS Water Wood Tsinghua Station Java Discussion Area

      • The Java discussion area of China's largest electronic bulletin board, with a number of university Java enthusiasts here to discuss Java technology

Author Introduction
    • De Minghui (duanmh@dns.ime.tsinghua.edu.cn), Tsinghua University, Department of Electrical Engineering students
      • is currently engaged in research and development of Java smart card microprocessor at Tsinghua University Microelectronics Institute
      • Lead BBS Water Wood Tsinghua Station Java discussion Group, for many Java technology applications to provide solutions


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.