In-depth analysis of Chinese problems in Java programming and suggestions for optimal solutions

Last Update:2016-11-30 Source: Internet

Author: User

Tags stdin

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Excerpt from: http://fafeng.blogbus.com/logs/3062998.html

Http://www.blogbus.com/fafeng-logs/3063006.html

In-depth analysis of Chinese problems in Java programming and suggestions for optimal solutions

Description: This article is the author's original, the author contact address is:[email protected]。 Because the Chinese problem in Java programming is a commonplace problem, after reading a lot about the Java Chinese problem solving method, combined with the author's programming practice, I found that many of the methods discussed in the past can not clearly explain the problem and solve the problem, especially cross-platform Chinese problem. So I give this article, which includes the class, Servelets, JSP, and the EJB class in the control console, I analyze and suggest the solution to the Chinese problem. I hope you will advise us. Any reference to this article please specify the source!!
Abstract: This paper analyzes the Java compiler's code/decoding process of Java source file and JVM's class file in Java programming, and through this process, the root cause of Chinese problem in Java programming is analyzed. Finally, the proposed optimization method to solve the problem of Java Chinese is given.
1, the source of Chinese problems
The encoding supported by the computer's original operating system is a single-byte character encoding, so all the handlers in the computer are initially processed in a single-byte encoded English. With the development of computers, in order to adapt to the language of the rest of the world (including our kanji, of course), people have proposed Unicode encoding, which uses double-byte encoding, compatible with English characters and other nationalities of the double-byte character encoding, so at present, most of the international software is Unicode-encoded internally, When the software is running, it obtains the encoding format that is supported by the local support system (operating system most of the time), and then the Unicode inside the software is converted to the default supported format of the local system. This is the case with Java's JDK and JVM, the JDK I'm referring to is the international version of the JDK, and most of our programmers use an internationalized JDK version, all of which are international JDK versions. Our Chinese characters are two-byte encoding language, in order to enable the computer to handle Chinese, we set ourselves gb2312, GBK, gbk2k and other standards to adapt to the needs of computer processing. Therefore, most of the operating system in order to adapt to our needs to handle Chinese, are customized with the Chinese operating system, they use the gbk,gb2312 encoding format to correctly display our Chinese characters. such as: Chinese win2k default is GBK encoding display, in the Chinese Win2K save the file by default, the encoding format of the saved file is GBK, that is, all the files saved in Chinese win2k its internal encoding by default is GBK encoded, Note: The GBK is based on the GB2312.
Because of the Unicode encoding inside the Java language, there is an issue of converting input and output from Unicode encoding and the corresponding operating system and browser-supported encoding format when the Java program is running, which has a series of steps, if any of them go wrong, The display of Chinese characters will be garbled, this is our common problem of Java Chinese.
At the same time, Java is a cross-platform programming language, that is, we write the program not only in the Chinese Windows run, but also in the Chinese Linux and other systems to run, but also require to be able to run in English and other systems (we often see people write Java programs in the Chinese Win2K, Migrated to English Linux). This porting operation also brings in Chinese issues.
In addition, some people use the English operating system and English IE and other browsers to run the program with Chinese characters and browse the English pages, they do not support Chinese, but also bring Chinese problems.
Yes, almost all browsers by default when passing parameters are passed in UTF-8 encoding format, rather than in Chinese encoding, so there will be problems when passing Chinese parameters, resulting in garbled behavior.
In short, the above aspects are the main source of the Chinese problem in Java, we put the above reasons caused by the program does not work correctly generated by the problem is called: Java Chinese.
2, the detailed process of the Java Encoding transformation
Our common Java programs include the following categories:
* Classes that run directly on the console (including the Visual interface classes)
*jsp Code Class (note: JSP is a variant of the Servlets Class)
*servelets class
*EJB class
* Other support classes that cannot be run directly
These class files are likely to contain Chinese strings, and we often use the first three classes of Java programs to interact directly with the user for output and input characters, such as: we get the characters from the client in the JSP and the servlet, which also include Chinese character. Regardless of the role of these Java classes, the life cycle of these Java programs is like this:
* Programmers Select an appropriate editing software on a certain operating system to implement the source code and save the. java extension in the operating system, for example, we use Notepad in the Chinese Win2K to edit a Java source program;
* Programmers use Javac.exe in the JDK to compile these source code to form A. Class class (JSP files are compiled by the container invoking the JDK);
* Run these classes directly or put them into a web container to run, and output the results.
So how do the JDK and JVM encode and decode and run these files in these processes?
Here, we use the Chinese Win2K operating system as an example of how Java classes are encoded and decoded.
The first step, we use editing software in the Chinese Win2K, such as Notepad to write a Java source program files (including the above five types of Java programs), the program files are saved by default, the operating system default support GBK encoding format (operating system default support format is file.encoding format) Formed a. java file, that is, Java program before being compiled, our Java source program files are in the operating system by default support file.encoding encoding format, Java source program contains Chinese information characters and English program code; To view the system's file.encoding parameters, you can Use the following code:
public class Showsystemdefaultencoding {
public static void Main (string[] args) {
String encoding = System.getproperty ("file.encoding");
SYSTEM.OUT.PRINTLN (encoding);
}}

     Step Two, We compile our Java source program with the JDK Javac.exe file, because the JDK is an international version, and at compile time, if we do not specify the encoding format of our Java source program with the-encoding parameter, then Javac.exe first obtains the encoding format that our operating system uses by default, that is, the compilation J Ava program, if we do not specify the encoding format of the source program file, the JDK first obtains the operating system's file.encoding parameter (it holds the operating system default encoding format, such as Win2K, which is the value of GBK), The JDK then translates our Java source program into memory from the file.encoding encoded format into the Java internal default Unicode format. And then Javac compiles the converted Unicode file into a. class file, at which point the. class file is Unicode encoded, it is temporarily placed in memory, and the JDK then saves this Unicode-encoded compiled class file to our operating system, which we see. clas S file. For us, the. class file that we finally get is the class file that the content is saved in Unicode encoded format, which contains the Chinese string inside our source program, except that it has been converted to Unicode format by file.encoding format.
     This step, for the JSP source program files are different, for the JSP, the process is this: the Web container calls the JSP compiler, the JSP compiler first look at the JSP file is set in the file encoding format, If you do not set the encoding format of the JSP file in the JSP file, the JSP compiler calls the JDK to convert the JSP file into a temporary servlet class using the JVM's default character encoding format (also known as the default file.encoding of the operating system where the Web container resides). It is then compiled into a Unicode-formatted class and saved in a temporary folder. For example, on Chinese Win2K, the Web container translates the JSP file from the GBK encoding format into Unicode format and compiles it into a temporary saved servlet class in response to the user's request.

The third step is to run the second-step compiled class into three scenarios:
A. Classes that run directly on the console
B, EJB classes, and support classes that cannot be run directly (such as the JavaBean Class)
C, JSP code, and Servlet classes
D. Between Java programs and databases
Let's look at these four scenarios.
A. Classes that run directly on the console
In this case, running the class requires JVM support first, which means that the JRE must be installed in the operating system. The process is as follows: First Java starts the JVM, at which point the JVM reads the class file stored in the operating system and reads the contents into memory, in memory in Unicode format class, and then the JVM runs it, if this class needs to receive user input at this time, The class will, by default, encode the user-entered string in the File.encoding encoding format and convert it to Unicode to save in memory (the user can set the encoding format for the input stream). After the program runs, the resulting string (Unicode encoded) is then handed back to the JVM, and finally the JRE converts the string to the file.encoding format (the user can set the encoding format of the output stream) to the operating system display interface and output to the interface.
For this class that runs directly on the console, its conversion process can be expressed more clearly in Figure 1

The conversion of each step above requires the correct encoding format conversion, in order to eventually do not appear garbled phenomenon.
B, EJB classes, and support classes that cannot be run directly (such as the JavaBean Class)
Because of EJB classes and unsupported classes that cannot be run directly, they generally do not interact directly with the user input and output, they often interact with other classes of input and output, so they are compiled in the second step, the content is the Unicode encoding of the class is saved in the operating system, Later, as long as the interaction between it and other classes is not lost during parameter passing, it will run correctly.
This EJB class and the support class that cannot be run directly, its conversion process can be more clearly expressed in Figure 2:

    c, JSP code, and Servlet class
     after the second step, The JSP file is also converted to the Servlets class file, except that it does not exist in the classes directory like the standard Servlets, it exists in the temporary directory of the Web container, so in this step we also see it as servlets.
     for Servlets, when the client requests it, the Web container invokes its JVM to run the servlet, first, the JVM reads the Servlet class class from the system and loads it into memory. In memory is the code of the Servlet class encoded in Unicode, and then the JVM runs the servlet class in memory, and if the servlet is running, it needs to accept the word from the client such as: The value of the form input and the value passed in the URL, At this point, if the program does not have an encoding format in which to accept the parameters, the Web container defaults to the ISO-8859-1 encoded format to accept the values passed in and is converted to Unicode format in the JVM in the memory of the Web container. After the servlet runs the output is generated, the output string is in Unicode format, and then the container sends the Unicode-formatted string (such as HTML syntax, user-output string, etc.) directly to the client browser and outputs it to the user, as the servlet runs. If the encoding format of the output is specified at this time, it is output to the browser in the specified encoding format and, if not specified, is sent to the client's browser by default by Iso-8859-1 encoding.
     This JSP code and the Servlet class, its conversion process can be more clearly expressed in Figure 3:

D. Between Java programs and databases
For almost all database JDBC drivers, the default pass data between Java programs and databases is in ISO-8859-1 as the default encoding format, so our program stores data containing Chinese in the database. JDBC First is to convert the Unicode encoding format data inside the program into a iso-8859-1 format, and then pass it to the database, and when the database is saved, it is iso-8859-1 saved by default, so this is why we often read the Chinese data in the database is garbled.

For the data transfer between the Java program and the database, we can express it clearly in Figure 4.

3, analysis of common Java Chinese issues must be clear principles
First, through the detailed analysis above, we can clearly see that in any Java program life period, the key process of its encoding conversion is: initially compiled into a class file transcoding and finally to the user output of the transcoding process.
Second, we must understand the following common encoding formats that Java supports at compile time:
*iso-8859-1,8-bit, with 8859_1,iso-8859-1,iso_8859_1 and other codes
*cp1252, American English code, ANSI standard code
*utf-8, with Unicode encoding
*gb2312, with gb2312-80,gb2312-1980 and other codes
*GBK, with MS936, it is the extension of gb2312
and other coding, such as Korean, Japanese, traditional Chinese and so on. At the same time, we should note that the compatibility of these coding between the system is as follows:
Unicode and UTF-8 encoding are one by one corresponding relationships. GB2312 can be thought of as a subset of GBK, where GBK encoding is extended on gb2312. At the same time, GBK encoding contains 20,902 characters, the encoding range is: 0x8140-0xfefe, all the characters can be mapped to UNICODE2.0 in one.
Again, for the. Java source program files that are placed in the operating system, at compile time, we can specify the encoding format of its content, specifically by using-encoding. Note: If the source program contains Chinese characters, and you use-encoding as a different encoding character, you obviously want to make an error. Using-encoding to specify the encoding of the source file is GBK or gb2312, no matter what system we compile with Chinese characters in the Java source program is not a problem, it will correctly convert the language into Unicode stored in the class file.
Then we must be aware that almost all of the web containers in their internal default character encoding format are iso-8859-1 defaults, while almost all browsers pass parameters by default in UTF-8 way. So, while our Java source file specifies the correct encoding at the point of entry and exit, it is handled by Iso-8859-1 when it is running inside the container.

4. The classification of Chinese problems and the best solutions
Understanding the above principles of Java processing files, we can propose a set of best solutions to solve the problem of Chinese characters.
Our goal is: we edit in the Chinese language system containing Chinese strings or Chinese-language processing of the Java source program can be compiled after the transfer of value to any other operating system to run correctly, or get the other operating system compiled to run correctly, can correctly pass the Chinese and English parameters, Able to correctly communicate with the database in English and Chinese strings.
Our specific idea is: In the Java program transcoding the entry and exit and Java program with the user has input and output conversion of the local limit encoding method to make it correct.
The specific solutions are as follows:
1. For classes that run directly on the console
In this case, we recommend that when the program is written, if you need to receive users from the client may contain Chinese input or output containing Chinese, the program should use a character stream to handle the input and output, specifically, the following character-oriented node stream type is applied:
To file: Filereader,filewrieter
Its byte-type node stream type is: Fileinputstream,fileoutputstream
For memory (array): Chararrayreader,chararraywriter
Its byte-type node stream type is: Bytearrayinputstream,bytearrayoutputstream
For memory (string): Stringreader,stringwriter
On pipe: Pipedreader,pipedwriter
Its byte-type node stream type is: Pipedinputstream,pipedoutputstream
At the same time, the input and output should be processed with the following character-oriented processing stream:
Bufferedwriter,bufferedreader
Its byte-type processing flow is: Bufferedinputestream,bufferedoutputstream
Inputstreamreader,outputstreamwriter
Its byte-type processing flow is: Datainputstream,dataoutputstream
Where InputStreamReader and inputstreamwriter are used to convert a byte stream to a character stream by a specified set of character encodings, such as:
InputStreamReader in = new InputStreamReader (system.in, "GB2312");
OutputStreamWriter out = new OutputStreamWriter (System.out, "GB2312");
For example, the following example Java code is used to achieve the requirements:
Read.java
Import java.io.*;
public class Read {
public static void Main (string[] args) throws IOException {
String str = "Chinese test, this is the internal hard-coded string" + "test English character";
String strin= "";
BufferedReader stdin = new BufferedReader (new InputStreamReader (system.in, "gb2312")); Set input interface by Chinese code
BufferedWriter stdout = new BufferedWriter (new OutputStreamWriter (System.out, "gb2312")); Set output interface by Chinese encoding
Stdout.write ("Please enter:");
Stdout.flush ();
Strin = Stdin.readline ();
Stdout.write ("This is a string entered from the User:" +strin);
Stdout.write (str);
Stdout.flush ();
}}
At the same time, when compiling the program, we do this in the following ways:
Javac-encoding gb2312 Read.java
It runs as shown in result 5:

2. Support classes for EJB classes and cannot be run directly (such as the JavaBean Class)
Since this class is called by other classes and does not interact directly with the user, the recommended approach for this class is that the internal program should use a character stream to handle the Chinese string inside the program (as in the previous section), while compiling the class with-encoding The gb2312 parameter indicates that the source file is encoded in Chinese format.
3. For Servlet class
For Servlets, we recommend the following methods:
When compiling the source program of the servlet class, use-encoding to specify the encoding as GBK or GB2312, and use the Response object's setContentType ("TEXT/HTML;CHARSET=GBK") in the encoding portion of the output to the user; or gb2312 to set the output encoding format, we also use Request.setcharacterencoding ("GB2312") when receiving user input, so no matter what operating system our Servlet class is ported to, Only the client's browser supports the Chinese display, it can be displayed correctly. The following is a good example:

It runs as shown in result 6:
　　

4. Between Java programs and databases
To avoid garbled data passing between Java programs and databases, we recommend the following best practices for handling:
1, for the Java program processing method according to our specified method processing.
2. Change the encoding format supported by the database to GBK or GB2312.
For example: In MySQL, we can add the following statement to the configuration file My.ini:
Increase in [mysqld] area:
Default-character-set=gbk
and add:
[Client]
Default-character-set=gbk
In SQL server2k, we can set the default language of the database to Simplified Chinese to achieve the goal.
5. For JSP code
Since JSP is dynamically compiled by the Web container at run time, if we do not specify the encoding format of the JSP source file, The JSP compiler will get the server operating system file.encoding value to compile the JSP file, it is most prone to problems when porting, such as in the Chinese win2k can be very good to run JSP file to the English Linux do not, although the client is the same, it is because the container in the compilation of the JSP file to get the operating The coding of the system is different (the file.encoding in Chinese wink and the file.encoding in the English Linux are different, and the English Linux file.encoding is not supported in Chinese, so the compiled JSP class will have a problem). Most of the discussion on the network is such a problem, mostly because the JSP file porting platform is not displayed correctly, for such problems, we understand the Java program code conversion principle, the solution is much easier. The solutions we propose are as follows:
1, we want to ensure that the JSP output to the client is in Chinese encoding output, that is, in any case we first add the following line in our JSP source generation:

2, in order to let the JSP can correctly get the parameters passed in, we add the following sentence in the JSP source file header:

3, in order to let the JSP compiler can correctly decode our JSP file containing Chinese characters, we need to specify in the JSP source file of our JSP source file encoding format, specifically, we add the following sentence in the JSP source file header:
Or
This is a new addition to the JSP Specification 2.0 directive.
We recommend using this method to solve the Chinese problem in the JSP file, the following code is a test procedure for the correct practice of the JSP file:

7 is the result of this program running:

5 , Summary
In the above detailed analysis, we clearly give the Java in the process of processing the source program of the detailed conversion process, for us to correctly solve the problem of Java programming in the Chinese language to provide a basis. At the same time, we give the best solution to the problem of Java Chinese.
6 , reference
1, De Minghui. Analysis and solution of Chinese character problem in Java programming technology.

Http://www-900.ibm.com/developerWorks/cn/java/java_chinese/index.shtml
2. Zhou. Several analytical principles on the problem of Java Chinese.
Http://www-900.ibm.com/developerWorks/cn/java/l-javachinese/index.shtml
3. Introduction of the author.

Abnerchai, senior programmer, Author contact method:[email protected] .

In-depth analysis of Chinese problems in Java programming and suggestions for optimal solutions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More