Java Chinese Solution Encyclopedia (on)

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solve | Chinese

Description: This article is the author original, the author contact address is: josserchai@yahoo.com. Since the Chinese problem in Java programming is a cliché, after reading a lot about Java Chinese problem solving, and combining the author's programming practice, I find that many of the methods used in the past do not clearly explain the problem and solve the problem, especially the Chinese problem when cross-platform. So I gave this article, which includes the class, Servelets, JSP and the Chinese problem in the EJB Class I analyze and suggest solutions to the console running. I hope you will advise.

Abstract: This paper deeply analyzes the Java compiler in Java programming Java source files and JVM to the class file encoding/decoding process, through the process of analysis of Java programming in the root causes of Chinese problems, Finally, the proposed optimization method for solving Java Chinese problems is given.

　　1. The source of Chinese problems

The encoding that the computer's original operating system supports is a single-byte character encoding, so that all handlers in the computer are initially processed in single-byte-encoded English. With the development of computers, in order to adapt to the language of other peoples of the world (including our Chinese characters, of course), people put forward Unicode encoding, which uses Double-byte code, compatible with English characters and other ethnic double-byte character encoding, so at present, most of the international software is in the internal use of Unicode encoding, When the software is running, it obtains the encoding format supported by the local support system (most of the time operating system), and then converts the Unicode inside the software to the local system default supported format. This is true of the Java JDK and JVM, and I refer to the JDK as an international version of JDK, and most of our programmers are using an internationalized JDK version, all of which are the international JDK versions. Our Chinese character is a double-byte coding language, in order to allow the computer to handle Chinese, we have developed gb2312, GBK, gbk2k and other standards to meet the needs of computer processing. Therefore, most of the operating systems in order to adapt to our needs to deal with Chinese, are customized with the Chinese operating system, they are using the GBK,GB2312 encoding format to correctly display our Chinese characters. For example: Chinese Win2K By default is GBK encoding display, in Chinese Win2K save files in the default format of the saved file is also GBK, that is, all the files stored in the Chinese win2k its internal code by default are used GBK code, Note: GBK is expanded on the basis of GB2312.

Because of the Unicode encoding inside the Java language, there is a problem with the conversion of input and output from Unicode encoding and corresponding operating system and browser-supported encoding format, which has a series of steps, if any of them are wrong, Then the display of Chinese characters will be garbled, this is our common Java Chinese problem.

At the same time, Java is a cross-platform programming language, that is, we write the program can not only run on Chinese windows, but also in Chinese Linux and other systems, but also required to be able to operate in English and other systems (we often see someone to write on the Chinese Win2K Java program, Ported to Linux on English run). This kind of porting operation also can bring Chinese problem.

Also, some people use the English operating system and English IE and other browsers, to run the program with Chinese characters and browse the Web page, they do not support Chinese, they will also bring Chinese problems.

Almost all browsers by default when passing parameters are passed in UTF-8 encoding format, rather than by the Chinese code transmission, so, the transfer of Chinese parameters will also have problems, resulting in garbled phenomenon.

In short, these are the main sources of Chinese problems in Java, we put the above reasons caused by the program does not run correctly The problem is called: Java Chinese problem.

　2. The detailed process of Java Coding Conversion

Our common Java programs include the following categories:
* Classes that run directly on the console (including classes of visual interfaces)
*jsp Code Class (note: JSP is a variant of the Servlets Class)
*servelets class
*EJB class
* Other support classes that cannot be run directly

These class files are likely to contain Chinese strings, and we use the first three Java programs to interact directly with the user, for output and input characters, such as: we get the characters from the client in the JSP and servlet, and these characters also include Chinese character. Regardless of how these Java classes work, the lifecycle of these Java programs is the same:

* Programmers Select a suitable editing software on a certain operating system to implement source code and keep the. java extension in the operating system, for example, we use Notepad to edit a Java source program in Chinese Win2K;
* Programmers use the Javac.exe in JDK to compile the source code to form. Class classes (JSP files are compiled by the container invoking the JDK);
* Run these classes directly or put them into a web container to run and output the results.
So how do jdk and JVM encode and decode and run these files in these processes?

Here, we use the Chinese Win2K operating system as an example to illustrate how Java classes are encoded and decoded.

The first step, we in the Chinese win2k with editing software such as Notepad to write a Java source program files (including the above five types of Java programs), program files are saved by default, operating system default support GBK encoding format ( The format supported by the operating system default is file.encoding format) Formed a. java file, that is, Java programs are compiled, our Java source program files are supported by the operating system by default file.encoding encoding format saved, Java source program contains Chinese information characters and English program code; To view the system's file.encoding parameters, you can Use the following code:
public class Showsystemdefaultencoding {
public static void Main (string[] args) {
String encoding = System.getproperty ("file.encoding");
SYSTEM.OUT.PRINTLN (encoding);
}}

Second Step, We use JDK's Javac.exe file to compile our Java source program, because the JDK is International edition, if we do not use the-encoding parameter to specify our Java source code format, then Javac.exe first get our operating system default encoding format, that is, in the compilation of J Ava program, if we do not specify the encoding format of the source program file, the JDK first obtains the operating system's file.encoding parameter (it holds the operating system default encoding format, such as Win2K, its value is GBK), The JDK then converts our Java source program from the file.encoding encoding format into the Java internal default Unicode format into memory. And then Javac compiles the converted Unicode file into a. Class class file, where the. class file is Unicode encoded, which is held in memory, and then the JDK saves this Unicode-encoded compiled class file into our operating system to form what we see. clas S file. For us, the. class file that we end up with is the class file that content is saved in Unicode encoding format, which contains the Chinese string in our source program, except that it has been converted to Unicode format through the file.encoding format.

In this step, for the JSP source program files are different, for JSP, this process is this: the Web container calls the JSP compiler, the JSP compiler first to see whether the JSP file in the file encoding format, if the JSP file does not set the code format JSP file, The JSP compiler invokes the JDK to convert the JSP file into a temporary servlet class using the JVM default character encoding format (also known as the default file.encoding of the operating system where the Web container is located), and then compiles it into the class class in Unicode format. and save it in a temporary folder. For example, in Chinese Win2K, the Web container converts the JSP file from the GBK encoding format into Unicode format and then compiles it into a temporarily saved servlet class in response to the user's request.

The third step, run the second step of the compiled class, divided into three kinds of cases:

A, classes that run directly on the console
B, EJB classes, and support classes that cannot be run directly (such as the JavaBean Class)
C, JSP code and Servlet class
D, between Java programs and databases
Here we divide these four kinds of situation to see.
A, classes that run directly on the console

In this case, running the class requires JVM support first, that is, the JRE must be installed in the operating system. The running process is this: First Java starts the JVM, at which point the JVM reads out the stored class file in the operating system and reads the contents into memory, in memory for the class class in Unicode format, and then the JVM runs it, and if this class needs to receive user input at this time, The class defaults to encoding the user-entered string in the File.encoding encoding format and converts it to Unicode memory (the user can set the encoding format for the input stream). After the program is run, the resulting string (Unicode encoded) is returned to the JVM, and the last JRE converts the string to the file.encoding format (the user can set the encoding format for the output stream) to the operating system display interface and output to the interface.

For this class that runs directly on the console, its conversion process can be expressed more explicitly in Figure 1:

Figure 1

Each step of the above transformation requires the correct encoding format to transform, in order to eventually not appear garbled phenomenon.

B, EJB classes, and support classes that cannot be run directly (such as the JavaBean Class)

Because EJB classes and support classes that cannot be run directly, they typically do not interact with and output directly from the user, they often interact with other classes to input and output, so when they are compiled in the second step, they form a class in which the content is Unicode encoded and stored in the operating system. As long as the interaction between it and other classes is not lost during parameter passing, it will run correctly.
This EJB class and the support class that cannot be run directly, its transformation process can be expressed more clearly in Figure 2:

Figure 2

C, JSP code and servlet class

After the second step, the JSP file is also converted into a Servlets class file, except that it is not like the standard Servlets one school exists in the classes directory, it exists in the temporary directory of the Web container, so this step we also make it as a servlets look.

For Servlets, when the client requests it, the Web container invokes its JVM to run the servlet, and first, the JVM reads and loads the class classes of the servlet from the system into memory, which is the code for the Unicode-encoded servlet class in memory. The JVM then runs the servlet class in memory, and if the servlet is running, it needs to accept the word Furu from the client: the value entered in the form and the value passed in the URL, if the program does not have the encoding format to use when accepting parameters, The Web container defaults to the ISO-8859-1 encoding format to accept incoming values and converts them in the JVM into Unicode format in memory stored in the Web container. When the servlet runs, it generates output, and the output string is in Unicode format, followed by the container running a servlet-generated string of Unicode format (such as HTML syntax, user output string, etc.) directly to the client browser and output to the user. If you specify an encoded format for the output at this time, output to the browser in the specified encoding format, and if not specified, the default is sent to the client's browser by ISO-8859-1 encoding. This JSP code and the Servlet class, its conversion process can be more clearly expressed in Figure 3:

Figure 3

D, between Java programs and databases

For nearly all JDBC drivers for databases, the default encoding of data between Java programs and databases is iso-8859-1, so our program stores Chinese-language data in the database JDBC is the first to convert the data in the Unicode format in the program into iso-8859-1 format, and then passed to the database, when the database save the data, it defaults to iso-8859-1 save, so this is why we often read in the database of Chinese data is garbled.
For data transfer between Java programs and databases, we can clearly show them in Figure 4.

Figure 4

    3, analysis of common Java Chinese problems several principles must be clear

    First, after a detailed analysis of the above, As we can see clearly, the key process of coding conversion in any Java program's life is: The transcoding process originally compiled into the class file and the final transcoding to the user.
    Secondly, we must understand the following types of coding formats that Java supports at compile time:
    *iso-8859-1,8-bit, with 8859_1, Iso-8859-1,iso_8859_1 encoding
    *cp1252, American English encoding, ANSI standard encoding
    *utf-8, with Unicode encoding
    *gb2312, with gb2312-80,gb2312-1980 encoding
    *GBK, with MS936, which is gb2312 extensions
   and other codes, such as Korean, Japanese, traditional Chinese, etc. At the same time, we should note that the compatibility between these encodings is as follows:
    Unicode and UTF-8 encodings are one by one corresponding relationships. GB2312 can be considered a subset of GBK, that is, GBK encoding is extended on gb2312. At the same time, the GBK code contains 20,902 Chinese characters, the encoding range is: 0x8140-0xfefe, all characters can correspond to the UNICODE2.0.

Again, for the. Java source program files that are placed in the operating system, at compile time, we can specify the encoding format of its contents, specifically by-encoding. Note: If the source program contains Chinese characters, and you use-encoding to specify other encoded characters, there is obviously an error. Using-encoding to specify that the source file is encoded as GBK or gb2312, no matter what system we compile the Java source with the Chinese characters, it will correctly translate it into Unicode stored in the class file.

Then, we have to be clear that almost all web containers in their internal default character encoding format is iso-8859-1 as the default value, and almost all browsers pass parameters by default to pass the parameters in UTF-8 way. So, although our Java source file specifies the correct encoding in the entry and exit, it is iso-8859-1 when it is running inside the container.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Chinese Solution Encyclopedia (on)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java Chinese Solution Encyclopedia (on)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support