Java Chinese problems and optimal solutions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solutions | questions | Chinese 1, the source of Chinese problems

The encoding that the computer's original operating system supports is a single-byte character encoding, so that all handlers in the computer are initially processed in single-byte-encoded English. With the development of computers, in order to adapt to the language of other peoples of the world (including our Chinese characters, of course), people put forward Unicode encoding, which uses Double-byte code, compatible with English characters and other ethnic double-byte character encoding, so at present, most of the international software is in the internal use of Unicode encoding, When the software is running, it obtains the encoding format supported by the local support system (most of the time operating system), and then converts the Unicode inside the software to the local system default supported format. This is true of the Java JDK and JVM, and I refer to the JDK as an international version of JDK, and most of our programmers are using an internationalized JDK version, all of which are the international JDK versions. Our Chinese character is a double-byte coding language, in order to allow the computer to handle Chinese, we have developed gb2312, GBK, gbk2k and other standards to meet the needs of computer processing. Therefore, most of the operating systems in order to adapt to our needs to deal with Chinese, are customized with the Chinese operating system, they are using the GBK,GB2312 encoding format to correctly display our Chinese characters. For example: Chinese Win2K By default is GBK encoding display, in Chinese Win2K save files in the default format of the saved file is also GBK, that is, all the files stored in the Chinese win2k its internal code by default are used GBK code, Note: GBK is expanded on the basis of GB2312.

Because of the Unicode encoding inside the Java language, there is a problem with the conversion of input and output from Unicode encoding and corresponding operating system and browser-supported encoding format, which has a series of steps, if any of them are wrong, Then the display of Chinese characters will be garbled, this is our common Java Chinese problem.

At the same time, Java is a cross-platform programming language, that is, we write the program can not only run on Chinese windows, but also in Chinese Linux and other systems, but also required to be able to operate in English and other systems (we often see someone to write on the Chinese Win2K Java program, Ported to Linux on English run). This kind of porting operation also can bring Chinese problem.

Also, some people use the English operating system and English IE and other browsers, to run the program with Chinese characters and browse the Web page, they do not support Chinese, they will also bring Chinese problems.

Almost all browsers by default when passing parameters are passed in UTF-8 encoding format, rather than by the Chinese code transmission, so, the transfer of Chinese parameters will also have problems, resulting in garbled phenomenon.

In short, these are the main sources of Chinese problems in Java, we put the above reasons caused by the program does not run correctly The problem is called: Java Chinese problem.

2. The detailed process of Java Coding Conversion

Our common Java programs include the following categories:
* Classes that run directly on the console (including classes of visual interfaces)
*jsp Code Class (note: JSP is a variant of the Servlets Class)
*servelets class
*EJB class
* Other support classes that cannot be run directly

These class files are likely to contain Chinese strings, and we use the first three Java programs to interact directly with the user, for output and input characters, such as: we get the characters from the client in the JSP and servlet, and these characters also include Chinese character. Regardless of how these Java classes work, the lifecycle of these Java programs is the same:

* Programmers Select a suitable editing software on a certain operating system to implement source code and keep the. java extension in the operating system, for example, we use Notepad to edit a Java source program in Chinese Win2K;
* Programmers use the Javac.exe in JDK to compile the source code to form. Class classes (JSP files are compiled by the container invoking the JDK);
* Run these classes directly or put them into a web container to run and output the results.
So how do jdk and JVM encode and decode and run these files in these processes?

Here, we use the Chinese Win2K operating system as an example to illustrate how Java classes are encoded and decoded.

The first step, we in the Chinese win2k with editing software such as Notepad to write a Java source program files (including the above five types of Java programs), program files are saved by default operating system default support GBK encoding format (operating system default supported format is file.encoding format) Formed a. java file, that is, Java programs are compiled, our Java source program files are supported by the operating system by default file.encoding encoding format saved, Java source program contains Chinese information characters and English program code; To view the system's file.encoding parameters, you can Use the following code:
public class Showsystemdefaultencoding {
public static void Main (string[] args) {
String encoding = System.getproperty ("file.encoding");
SYSTEM.OUT.PRINTLN (encoding);
}}

Second step, We use JDK's Javac.exe file to compile our Java source program, because the JDK is International edition, if we do not use the-encoding parameter to specify our Java source code format, then Javac.exe first get our operating system default encoding format, that is, in the compilation of J Ava program, if we do not specify the encoding format of the source program file, the JDK first obtains the operating system's file.encoding parameter (it holds the operating system default encoding format, such as Win2K, its value is GBK), The JDK then converts our Java source program from the file.encoding encoding format into the Java internal default Unicode format into memory. And then Javac compiles the converted Unicode file into a. Class class file, where the. class file is Unicode encoded, which is held in memory, and then the JDK saves this Unicode-encoded compiled class file into our operating system to form what we see. clas S file. For us, the. class file that we end up with is the class file that content is saved in Unicode encoding format, which contains the Chinese string in our source program, except that it has been converted to Unicode format through the file.encoding format.

In this step, for the JSP source program files are different, for JSP, this process is this: the Web container calls the JSP compiler, the JSP compiler first to see whether the JSP file in the file encoding format, if the JSP file does not set the code format JSP file, The JSP compiler invokes the JDK to convert the JSP file into a temporary servlet class using the JVM default character encoding format (also known as the default file.encoding of the operating system where the Web container is located), and then compiles it into the class class in Unicode format. and save it in a temporary folder. For example, in Chinese Win2K, the Web container converts the JSP file from the GBK encoding format into Unicode format and then compiles it into a temporarily saved servlet class in response to the user's request.

The third step, run the second step of the compiled class, divided into three kinds of cases:

A, classes that run directly on the console
B, EJB classes, and support classes that cannot be run directly (such as the JavaBean Class)
C, JSP code and Servlet class
D, between Java programs and databases
Here we divide these four kinds of situation to see.
A, classes that run directly on the console

In this case, running the class requires JVM support first, that is, the JRE must be installed in the operating system. The running process is this: First Java starts the JVM, at which point the JVM reads out the stored class file in the operating system and reads the contents into memory, in memory for the class class in Unicode format, and then the JVM runs it, and if this class needs to receive user input at this time, The class defaults to encoding the user-entered string in the File.encoding encoding format and converts it to Unicode memory (the user can set the encoding format for the input stream). After the program is run, the resulting string (Unicode encoded) is returned to the JVM, and the last JRE converts the string to the file.encoding format (the user can set the encoding format for the output stream) to the operating system display interface and output to the interface.

For this class that runs directly on the console, its conversion process can be expressed more explicitly in Figure 1:

Sorry, the picture can't pass up, had to let everybody to imagine the appearance of the picture, I want to see the above is can want to come to the picture. ）
Each step of the above transformation requires the correct encoding format to transform, in order to eventually not appear garbled phenomenon.

B, EJB classes, and support classes that cannot be run directly (such as the JavaBean Class)

Because EJB classes and support classes that cannot be run directly, they typically do not interact with and output directly from the user, they often interact with other classes to input and output, so when they are compiled in the second step, they form a class in which the content is Unicode encoded and stored in the operating system. As long as the interaction between it and other classes is not lost during parameter passing, it will run correctly.
This EJB class and the support class that cannot be run directly, its transformation process can be expressed more clearly in Figure 2:

Figure 2
Sorry, the picture can't pass up, had to let everybody to imagine the appearance of the picture, I want to see the above is can want to come to the picture. ）
C, JSP code and Servlet class

After the second step, the JSP file is also converted into a Servlets class file, except that it is not like the standard Servlets one school exists in the classes directory, it exists in the temporary directory of the Web container, so this step we also make it as a servlets look.

For Servlets, when the client requests it, the Web container invokes its JVM to run the servlet, and first, the JVM reads and loads the class classes of the servlet from the system into memory, which is the code for the Unicode-encoded servlet class in memory. The JVM then runs the servlet class in memory, and if the servlet is running, it needs to accept the word Furu from the client: the value entered in the form and the value passed in the URL, if the program does not have the encoding format to use when accepting parameters, The Web container defaults to the ISO-8859-1 encoding format to accept incoming values and converts them in the JVM into Unicode format in memory stored in the Web container. When the servlet runs, it generates output, and the output string is in Unicode format, followed by the container running a servlet-generated string of Unicode format (such as HTML syntax, user output string, etc.) directly to the client browser and output to the user. If you specify an encoded format for the output at this time, output to the browser in the specified encoding format, and if not specified, the default is sent to the client's browser by ISO-8859-1 encoding. This JSP code and the Servlet class, its conversion process can be more clearly expressed in Figure 3:

Sorry, the picture can't pass up, had to let everybody to imagine the appearance of the picture, I want to see the above is can want to come to the picture. ）
D, between Java programs and databases

For nearly all JDBC drivers for databases, the default encoding of data between Java programs and databases is iso-8859-1, so our program stores Chinese-language data in the database JDBC is the first to convert the data in the Unicode format in the program into iso-8859-1 format, and then passed to the database, when the database save the data, it defaults to iso-8859-1 save, so this is why we often read in the database of Chinese data is garbled.
For data transfer between Java programs and databases, we can clearly show them in Figure 4.

Figure 4 (sorry, the picture does not come up, had to let everyone to imagine the appearance of the picture, I want to see the above can want to come to the map. ）

3, analysis of common Java Chinese problems must be clear principles

First, through the detailed analysis above, we can clearly see that any Java program life, the key process of coding conversion is: Originally compiled into a class file transcoding and eventually to the user output transcoding process.
Second, we have to understand the common coding formats that Java supports at compile time:
*iso-8859-1,8-bit, same 8859_1,iso-8859-1,iso_8859_1 and other codes
*cp1252, American English code, ANSI standard code
*utf-8, with Unicode encoding
*gb2312, same gb2312-80,gb2312-1980 and other codes
*GBK, with MS936, it is the expansion of gb2312
and other coding, such as Korean, Japanese, traditional Chinese, and so on. At the same time, we should note that the compatibility between these codes is as follows:
Unicode and UTF-8 encodings are one by one corresponding relationships. GB2312 can be considered a subset of GBK, that is, GBK encoding is extended on gb2312. At the same time, the GBK code contains 20,902 Chinese characters, the encoding range is: 0x8140-0xfefe, all characters can correspond to the UNICODE2.0.

Again, for the. Java source program files that are placed in the operating system, at compile time, we can specify the encoding format of its contents, specifically by-encoding. Note: If the source program contains Chinese characters, and you use-encoding to specify other encoded characters, there is obviously an error. Using-encoding to specify that the source file is encoded as GBK or gb2312, no matter what system we compile the Java source with the Chinese characters, it will correctly translate it into Unicode stored in the class file.

Then, we have to be clear that almost all web containers in their internal default character encoding format is iso-8859-1 as the default value, and almost all browsers pass parameters by default to pass the parameters in UTF-8 way. So, although our Java source file specifies the correct encoding in the entry and exit, it is iso-8859-1 when it is running inside the container.

4. Classification of Chinese problems and the optimal solution

After understanding the principle of Java processing file, we can put forward a set of optimal solution to the problem of Chinese characters.
Our goal is: we edited in the Chinese system contains Chinese strings or Chinese processing Java source program can be compiled to move the value of any other operating system to run correctly, or to get the other operating system compiled to run correctly, can correctly transfer Chinese and English parameters, Able to communicate with database correctly and in English string.
Our specific ideas are: In the Java program transcoding entry and Exit and Java program with the user has input and output conversion of the local limit coding method to make it correct.

Specific solutions are as follows:

1. For classes running directly on the console
In this case, we recommend that when the program is written, if you need to receive the user from the client may contain Chinese input or contain Chinese output, the program should use the character stream to process input and output, in particular, apply the following character-oriented node flow type:
To file: Filereader,filewrieter
Its byte type node stream type is: Fileinputstream,fileoutputstream
Pairs of Memory (array): Chararrayreader,chararraywriter
Its byte type node stream type is: Bytearrayinputstream,bytearrayoutputstream
For memory (string): Stringreader,stringwriter
To pipe: pipedreader,pipedwriter
Its byte type node stream type is: Pipedinputstream,pipedoutputstream
At the same time, input and output should be processed using the following character-oriented processing streams:
Bufferedwriter,bufferedreader
The processing flow of its byte type is: Bufferedinputestream,bufferedoutputstream
Inputstreamreader,outputstreamwriter
The processing flow of its byte type is: Datainputstream,dataoutputstream
Where InputStreamReader and inputstreamwriter are used to convert a byte stream to a character string according to the specified set of characters, such as:
InputStreamReader in = new InputStreamReader (system.in, "GB2312");
OutputStreamWriter out = new OutputStreamWriter (System.out, "GB2312");
For example, the following sample Java encoding is required:

Read.java
Import java.io.*;
public class Read {
public static void Main (string[] args) throws IOException {
string str = "\ n Chinese test, this is an internal hard-coded string" + "\ntest 中文版 character";
String strin= "";
BufferedReader stdin = new BufferedReader (new InputStreamReader (system.in, "gb2312")); Set the input interface to be encoded in Chinese
BufferedWriter stdout = new BufferedWriter (new OutputStreamWriter (System.out, "gb2312")); Set the output interface to be encoded in Chinese
Stdout.write ("Please enter:");
Stdout.flush ();
Strin = Stdin.readline ();
Stdout.write ("This is the string entered from the User:" +strin);
Stdout.write (str);
Stdout.flush ();
}}
At the same time, when compiling a program, we do it in the following ways:
Javac-encoding gb2312 Read.java
The results of the operation are shown in Figure 5:

Figure 5 (sorry, the picture does not come up, had to let everyone to imagine the appearance of the picture, I want to see the above can want to come to the map. ）

2, for the EJB class and can not run directly support classes (such as JavaBean Class)

Because the classes themselves are called by other classes, do not directly interact with the user, so for this class, our recommended approach is that the internal program should use the character stream to handle the Chinese string inside the program (as in the previous section), while compiling the class with the-encoding The gb2312 parameter indicates that the source file is encoded in Chinese format.

3, for the servlet class

For the servlet, we recommend the following methods:

When compiling the source program of the Servlet class, specify the encoding as GBK or GB2312 with-encoding, and the setContentType ("TEXT/HTML;CHARSET=GBK") of the response object in the encoding part of the output to the user; or gb2312 to set the output encoding format, we use Request.setcharacterencoding ("GB2312") when receiving user input, so no matter what OS our servlet class is ported to, Only the client's browser supports Chinese display, it can be displayed correctly. The following is a correct example:

Helloworld.java
Package Hello;
Import java.io.*;
Import javax.servlet.*;
Import javax.servlet.http.*;
public class HelloWorld extends HttpServlet
{
public void Init () throws Servletexception {}
public void doget (HttpServletRequest request, httpservletresponse response) throws IOException, Servletexception
{
Request.setcharacterencoding ("GB2312"); Set input encoding format
Response.setcontenttype ("text/html;charset=gb2312"); Set Output encoding format
PrintWriter out = Response.getwriter (); Recommended use of PrintWriter output
Out.println ("Out.println ("Hello world! This is created test Chinese by servlet!! ");
Out.println ("}
public void DoPost (HttpServletRequest request, httpservletresponse response) throws IOException, Servletexception
{
Request.setcharacterencoding ("GB2312"); Set input encoding format
Response.setcontenttype ("text/html;charset=gb2312"); Set Output encoding format
String name = Request.getparameter ("name");
String id = request.getparameter ("id");
if (name==null) name= "";
if (id==null) id= "";
PrintWriter out = Response.getwriter (); Recommended use of PrintWriter output
Out.println ("Out.println ("The text string you pass in is:" + name);
OUT.PRINTLN ("Out.println ("}
public void Destroy () {}
}
Please compile this program with javac-encoding gb2312 Helloworld.java.
The program that tests this servlet looks like this:
<% @page contenttype= "text/html; charset=gb2312 "%>
<%request.setcharacterencoding ("GB2312");%>
<script language= "JavaScript" >
function Submit () {
Passing Chinese string values to the servlet by URL
Document.base.action = "./helloworld?name= Chinese";
Document.base.method = "POST";
Document.base.submit ();
}
</Script>

<body bgcolor= "#FFFFFF" text= "#000000" topmargin= "5" >
<form Name= "base" method = "POST" target= "_self" >
<input name= "id" type= "text" value= "size=" >
<a href = "javascript:submit ()" > Pass to Servlet</a>
</form></body>The results of the operation are shown in Figure 6:

Figure 6 (sorry, the picture does not come up, had to let everyone to imagine the appearance of the picture, I want to see the above can want to come to the map. ）

4. Between Java program and database

To avoid garbled data passing between Java programs and databases, we recommend the following best methods to handle:
1. The processing method of Java program is handled according to the method we specify.
2. Change the encoding format supported by the database default to GBK or GB2312.

For example: In MySQL, we can add the following statement to the configuration file My.ini:
Increase in [mysqld] area:
Default-character-set=gbk
and Increase:
[Client]
Default-character-set=gbk
In SQL server2k, we can set the database default language to Simplified Chinese to achieve the goal.

5, for JSP code

Because the JSP is dynamically compiled by the Web container at run time, if we do not specify the encoding format of the JSP source file, The JSP compiler will get the file.encoding value of the server operating system to compile the JSP file, it is most likely to have problems when porting, such as in Chinese win2k can be very good to run the JSP file to get English Linux is not, although the client is the same, that is because the container in the compilation of JSP files to get the exercise As a result of the different coding of the system (in the Chinese wink file.encoding and in the English Linux file.encoding is different, and the English Linux file.encoding to Chinese does not support, so the compiled JSP class will have problems). Most of the discussion on the network is such a problem, mostly because the JSP file porting platform does not correctly display the problem, for this kind of problem, we understand the Java program code conversion principle, to solve it is much easier. The solutions we propose are as follows:

1, we want to ensure that the JSP output to the client is encoded in Chinese output, that is, in any case, we first in our JSP source generation to add the following line:

<% @page contenttype= "text/html; charset=gb2312 "%>
2, in order to allow the JSP to get the parameters passed in correctly, we add the following sentence to the JSP source file header:
<%request.setcharacterencoding ("GB2312");%>
3, in order to allow the JSP compiler to correctly decode our JSP files containing Chinese characters, we need to specify our JSP source files in the JSP source file encoding format, specifically, we are in the JSP source file to add the following sentence can be:
<% @page pageencoding= "GB2312"%> or <% @page pageencoding= "GBK"%>
This is the JSP Specification 2.0 the newly added instruction.
We recommend using this method to solve the Chinese problem in the JSP file, the following code is a correct approach to the JSP file test program:

testchinese.jsp
<% @page pageencoding= "GB2312"%>
<% @page contenttype= "text/html; charset=gb2312 "%>
<%request.setcharacterencoding ("GB2312");%>
<%
String action = request.getparameter ("action");
String name = "";
String str = "";
if (Action!=null && action.equals ("SENT"))
{
Name = Request.getparameter ("name");
str = request.getparameter ("str");
}
%>
<title></title>
<script language= "JavaScript" >
function Submit ()
{
Document.base.action = "? Action=sent&str= incoming Chinese ";
Document.base.method = "POST";
Document.base.submit ();
}
</Script>
<body bgcolor= "#FFFFFF" text= "#000000" topmargin= "5" >
<form Name= "base" method = "POST" target= "_self" >
<input type= "text" name= "name" value= "size=" >
<a href = "javascript:submit ()" > Submit </a>
</form>
<%
if (Action!=null && action.equals ("SENT"))
{
Out.println ("<br> the character you entered is:" +name);
Out.println ("<br> the characters you pass through the URL are:" +str);
}
%>
</body>
Figure 7 is a schematic of the results of this program's operation:

Figure 7 (sorry, the picture does not come up, had to let everyone to imagine the appearance of the picture, I want to see the above can want to come to the map. ）

5, summary

In the detailed analysis above, we clearly give Java in the process of processing the source program in the detailed conversion process, for us to correctly solve the Java programming in the Chinese problem provides a foundation. At the same time, we give the best solution to the problem of Java Chinese.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Chinese problems and optimal solutions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java Chinese problems and optimal solutions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support