Analysis and solution of Chinese character problem in Java programming technology, file operation

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Programming, Chinese Characters | solving | Problems in the Java language programming, we often encounter the Chinese character processing and display problems. A whole bunch of confused
Garbled is certainly not what we would like to see the display effect, how can we make those Chinese characters correctly displayed? Java language
The default encoding is Unicode, and the files and databases that we Chinese usually use are based on GB2312
or BIG5 and so on, how to choose the encoding method properly and handle the Chinese characters ' compilation correctly
Where's the code? This article will start with encoding common sense, combine Java programming example, analyze above two questions and put forward
Solution to their solution.

Now the Java programming language has been widely used in the Internet world, as long as Sun company developed the Java language
, we have taken into account the support for non-English characters. The Sun Company's published Java runtime Environment (JRE) itself
In English and international editions, but only international editions support non-English characters. However, in the application of the Java programming language
, support for Chinese characters is not as perfect as is claimed in the Java Soft standard specification, because the Chinese text
Descriptor more than one, and different operating systems for the Chinese character support is not the same, so there will be many and Han
The problem of Word encoding processing is puzzling us in our application development. There are a lot of answers to these questions.
, but they are trivial, do not meet the urgent need to solve the problem of the desire, on the Java Chinese problem system
Research is not much, this article from encoding common sense, analysis of Java Chinese problems, I hope to solve this question
Problem.
Encoding's common sense
As we know, English characters are generally expressed in one byte, and the most common encoding method is ASCII. But a
Bytes can only distinguish between 256 characters and thousands of Chinese characters, so they are now expressed as double-byte characters, in order to
Can be separated from English characters, the highest bit of each byte must be 1, so that the double byte can represent up to 64K characters
。 We often encounter the coding methods are GB2312, BIG5, UNICODE and so on. Detailed information on the specific coding method
Materials, interested readers can access the relevant information. I have a superficial talk about GB2312 and UNI, who are closely related to us.
CODE. GB2312 code, the PRC national standard Chinese character Information exchange code, is a Chinese people
Hat issued by the National Standards Bureau on the simplification of Chinese characters encoding, the passage in mainland China and Singapore, referred to as the country
Standard Code. In two bytes, the value of the first byte (high byte) is the area code value Plus (20H), and the second byte (low
BYTE) is the value of the bit number plus (20H), which is used to represent the encoding of a Chinese character. UNICODE code is a micro
Soft proposed multi-byte equal-length coding to solve the problem of multi-state characters, which takes the front plus "0" bytes for the English characters
Slightly achieve equal-length compatibility. If the ASCII code for "A" is 0x41,unicode, it is 0x00,0x41. Using special
The tools of the various encodings can be converted to each other.
A preliminary understanding of Java Chinese problems
When we develop applications based on the Java programming language, we inevitably have to deal with Chinese. Java programming language Default
The encoding is UNICODE, and the databases and files we normally use are based on GB2312 encoding, and I
We often encounter such a situation: browsing based on the JSP technology site is garbled, the file opened after the See Also
Is garbled, the contents of the database modified by Java can not continue to provide information correctly when applied on other occasions.
String senglish = "Apple";
String Schinese = "Apple";
String s = "Apple apples";
The length of the senglish is 5,schinese length is 4, and s default length is 14. For Senglish,
, the various classes in Java are very well supported and must be displayed correctly. But for Schinese and S,
, although Java Soft declares that Java's basic classes have taken into account support for multinational characters (default UNICODE
encoding), but if the operating system's default encoding is not UNICODE, it is GB code. From Java source code
To get the right results, go through "Java source code-> Java bytecode->; virtual machine-> operating system-> display
The process of showing the device ". In each step of the above process, we must correctly deal with the encoding of Chinese characters in order to
Enough to make the final display the correct result.

"Java source code-> java bytecode", the standard Java compiler Javac uses a character set that is a system default
Recognition of the character set, such as on the Chinese Windows operating system is GBK, and on the Linux operating system is
Iso-8859-1, so you will find that the Chinese characters in the source file in the compiled classes on the Linux operating system are
The problem, the solution is to add encoding parameters at compile time, so as to be independent of the platform.
Usage is
Javac–encoding GBK.
"Java bytecode-> virtual machine-> operating system", Java Runtime Environment (JRE) in English and international editions,
However, only international editions support non-English characters. The Java Development Toolkit (JDK) certainly supports multinational characters, but
Not all computer users have JDK installed. Many operating systems and applications in order to better support the Ja
VA, the International version of the JRE is embedded, providing the convenience of supporting the multinational characters.
"Operating system-> display device", for Chinese characters, the operating system must support and be able to display it. English operation
If the system does not match Special application software, it is definitely not able to display Chinese.
Another problem is that in the Java programming process, the Chinese characters in the correct encoding conversion. For example, to
When the Web page prints Chinese strings, whether you are using
Out.println (string); String is a Chinese-language strings
or use
, must be converted to UNICODE to GBK, either manually, or automatically. In JSP 1.0
, you can define an output character set to enable automatic conversion of the inner code. Usage is

However, there is no support for the output character set in some JSP versions (for example, JSP 0.92), which requires
There are a lot of ways to encode the output manually. The most common method is
String S1 = request.getparameter ("keyword");
String s2 = new String (S1.getbytes ("iso-8859-1"), "GBK");
The GetBytes method is used to convert Chinese characters into byte arrays by "iso-8859-1" encoding, and "GBK"
is the target encoding method. We read the Chinese string S1 from a iso-8859-1 encoded database,
The above conversion process enables the proper display of Chinese strings in operating systems and applications that support the GBK character set
S2.
Surface analysis and treatment of Java Chinese problems
Background
Development environment
JDK1.15
Vcafe2.0
Jpadpro
Server-side
NT IIS
Sybase System
Jconnect (JDBC)
Client
IE5.0
Pwin98
. The CLASS file is stored on the server side, the applet is run by the client's browser, and the applet is only transferred into FR
The function of the main program such as AME class. The interface includes Textfield, Textarea,list,choice and so on.
I. Taking Chinese
After using JDBC to execute a SELECT statement to read data from the server side (Chinese), the data is added using the APPEND method
To TextArea (TA) and is not displayed correctly. But when added to the list, most Chinese characters are displayed correctly.

Converts the data to a byte array by "iso-8859-1" encoding, and then by system default encoding (Defaul
T Character Encoding) into a STRING, which can be displayed correctly in TA and list.
The procedure paragraph is as follows:
DBSTR2 = results.getstring (1);
After reading the "result" from DB server,converting it to string.
Dbbyte1 = Dbstr2.getbytes ("iso-8859-1");
DBSTR1 = new String (dbbyte1);
No system default encoding is used when converting strings, and "GBK" or "GB2312" are used directly in
In both A and B cases, there is no problem with data being fetched from the database.
Ii. writing Chinese to the database
The processing mode is inverse with "take Chinese", the SQL statement is converted into a byte array by the system default encoding, and then
The "Iso-8859-1" encoding is converted to STRING and sent to execution, the Chinese information is correctly written to the data
Library.
The procedure paragraph is as follows:
sqlstmt = Tf_input.gettext ();
Before sending statement to DB server,converting it to SQL statement.
Dbbyte1 = Sqlstmt.getbytes ();
sqlstmt = newstring (dbbyte1, "iso-8859-1");
_stmt = _con.createstatement ();
_stmt.executeupdate (SQLSTMT);
......
Problem: If there is a CLASSPATH on the client computer pointing to JDK CLASSES. ZIP (called A case),
The above program code can be executed correctly. But if the client has only a browser and no JDK and CLASSPATH
(known as the B case), the kanji cannot be converted correctly.
Our analysis:
1. After testing, in the case of A, the system's default encoding is GBK or GB2312 when the program is run. In
B case, the following error message appears in the browser's JAVA console when the program starts:
Can ' t find resource for Sun.awt.windows.awtLocalization_zh_CN
The system's default encoding is "8859-1".
2. If the system default encoding is not used when converting strings, instead of using "GBK" or "GB2312" directly
, the program still works in a case, and in the case of B, the system has an error:
Unsupportedencodingexception.
3. On the client, CLASSES the JDK. ZIP decompression, placed in another directory, CLASSPATH only package
Contains the directory. Then, gradually delete the directory. CLASS file, run the test program on the other side, and finally send
Now in more than 1000 CLASS files, only one is essential, and the file is:
Sun.io.CharToByteDoubleByte.class.
Copy the file to the server side and the other classes, and IMPORT it at the beginning of the program, in the case of B
The program still does not function correctly.
4. In the case of A, if the sun.io.CharToByteDoubleByte.class is removed in the classpth, the
When the program is run, the default encoding is "8859-1" or "GBK" or "GB2312".
If the JDK version is more than 1.2, the problem encountered in B situation is well resolved, the test step
At the same moment, interested readers can try.
[/b] Root cause analysis and solution of Java Chinese problems [/b]
Under Simplified Chinese MS Windows + JDK 1.3, you can use System.getproperties () to get Ja
VA running environment of some basic properties, class Poorchinese can help us to get these attributes.
Source code for Class Poorchinese:
public class Poorchinese {
public static void Main (string[] args) {
System.getproperties (). List (System.out);
}
}
After executing the Java Poorchinese, we will get:
The value of the system variable file.encoding is GBK, user.language value is en, user.region
Values are CN, and the value of these system variables determines the default encoding of the system is GBK.
In the above system, the following code converts GB2312 files into Big5 files that can help us understand
Transformation of encoding in Java:
Import java.io.*;
Import java.util.*;
public class Gb2big5 {
static int icharnum=0;
public static void Main (string[] args) {
System.out.println ("Input GB2312 file, output Big5 file.");
if (args.length!=2) {
System.err.println ("Usage:jview gb2big5 gbfile big5file");
System.exit (1);
}
String inputstring = Readinput (Args[0]);
WriteOutput (Inputstring,args[1]);
System.out.println ("Number of Characters in file: +icharnum+");
}
static void WriteOutput (String str, string stroutfile) {
try {
FileOutputStream fos = new FileOutputStream (stroutfile);
Writer out = new OutputStreamWriter (FOS, "Big5");
Out.write (str);
Out.close ();
}
catch (IOException e) {
E.printstacktrace ();
E.printstacktrace ();
}
}
static string Readinput (String strinfile) {
StringBuffer buffer = new StringBuffer ();
try {
FileInputStream fis = new FileInputStream (strinfile);
InputStreamReader ISR = new InputStreamReader (FIS, "GB2312");
Reader in = new BufferedReader (ISR);
int ch;
while (ch = in.read ()) >-1) {
Icharnum + 1;
Buffer.append ((char) ch);
}
In.close ();
return buffer.tostring ();
}
catch (IOException e) {
E.printstacktrace ();
return null;
}
}
}
The process of encoding conversion is as follows:
ByteToCharGB2312 CharToByteBig5
GB2312------------------>unicode------------->big5
Execute Java gb2big5 gb.txt big5.txt, if gb.txt content is "Today Wednesday", then you have to
The characters in the file Big5.txt can be displayed correctly, and if Gb.txt's content is "Happy Valentine's Day"
, then the resulting file big5.txt the "section" and "le" characters are all symbols "? "(0x3F),
Visible sun.io.ByteToCharGB2312 and Sun.io.CharToByteBig5 These two basic classes are not well prepared
。
As in the example above, Java's basic classes may also have problems. Because the internationalization of the work is not done at home
, so the support for the Chinese character is not as if it was not rigorously tested before these basic classes were released.
Perfect as the Java Soft claims. Not long ago, a technical friend of mine wrote to me that he had finally found
To the root of the Java Servlet Chinese problem. For two weeks, he has been a Chinese issue for the Java Servlet
Bothered because every string that contains a Chinese character must be cast to get the correct
Results (this seems to be the only solution known to all). And then he didn't want to go on like this.
, because such things really shouldn't be the work of a senior programmer, he finds the source of the Servlet decoding
The code is analyzed because he suspects that the problem is in the decoding part. After four hours of struggle, he finally found
The root of the problem. It turns out his suspicions are correct, and the Servlet's decoding part doesn't consider double byte at all.
, and%xx as a character directly. (The original Java Soft will also make this low-level error!) ）
If you are interested in this issue or if you are experiencing the same annoyance, you can follow his steps to the Servlet
. Jar to modify:
Locate the static private String parsename in source code httputils before returning SB (S
Tringbuffer), and then return the new String (BS, "GB2312"). Make up
After the modification, you need to decode the following:
HashTable form=httputils. Parsequerystring (Request.getquerystring ()) or
Form=httputils.parsepostdata (...)
Never forget to put it in the Servlet.jar after compiling it.
Summary of Java Chinese issues
Java programming language grew up in the Web world, which requires Java to have good support for multinational characters. Java programming
Language adapts to the need of computational network and lays a solid foundation for its rapid growth in the network world. J
Ava's creator (Java Soft) has taken into account the Java programming language's support for multinational characters, but now
The solution has a lot of flaws in it, and we need to put some compensatory measures into it. and the World Standardization Organization also
In an effort to unify all human words in one code, one of the schemes is ISO10646, which uses four
Byte to represent a character. Of course, before this scenario is adopted, it is hoped that the Java Soft can strictly
To test its products, to bring more convenience to the user.
Attached is a processing function to remove Chinese garbled from the database and network, the incoming parameter is a problematic string, the argument
Is the string that the problem has been resolved.
String Parsechinese (string in)
{
String s = null;
byte temp [];
if (in = = null)
{
SYSTEM.OUT.PRINTLN ("Warn:chinese null founded!");
return new String ("");
}
Try
{
Temp=in.getbytes ("iso-8859-1");
Temp=in.getbytes ("iso-8859-1");
s = new String (temp);
}
{
SYSTEM.OUT.PRINTLN ("Warn:chinese null founded!");
return new String ("");
}
Try
{
Temp=in.getbytes ("iso-8859-1");
s = new String (temp);
}
catch (Unsupportedencodingexception e)
{
System.out.println (E.tostring ());
}
return s;
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis and solution of Chinese character problem in Java programming technology, file operation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis and solution of Chinese character problem in Java programming technology, file operation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support