In-depth analysis of Chinese problems in Java programming and the best solution

Source: Internet
Author: User
Tags string to file
1. Source of Chinese problems
The computer's initial operating system supports single-byte character encoding. Therefore, in the computer, all processing programs are initially processed in English based on the single-byte encoding. With the development of computers, in order to adapt to the languages of other nations in the world (including our Chinese characters, of course), we have proposed unicode encoding, which uses dual-byte encoding, it is compatible with double-byte encoding of English characters and other nationalities. Therefore, most international software currently adopts unicode encoding, it obtains the default supported encoding formats of the Local Support System (most of the time is the operating system), and then converts the Unicode in the software to the supported formats by the local system by default. The same is true for Java's JDK and JVM. JDK here refers to the international version of JDK. Most of our programmers use the international version of JDK, all of the following JDK versions refer to the international JDK version. Our Chinese characters are double-byte encoding languages. In order to allow computers to process Chinese characters, we have developed standards such as gb2312, GBK, and gbk2k to meet the requirements of computer processing. Therefore, most operating systems have customized Chinese operating systems to meet our Chinese processing needs. They use GBK and gb2312 encoding formats to correctly display our Chinese characters. For example, the Chinese Win2k adopts GBK encoding display by default. When saving a file in Win2k, the encoding format of the saved file is also GBK, that is, the internal encoding of all files stored in Win2k by default adopts GBK encoding. Note: GBK is extended based on gb2312.
Because the Java language uses unicode encoding internally, when Java is running, there is a problem of converting the encoding formats supported by Unicode encoding and the corresponding operating system and the browser, this conversion process involves a series of steps. If any of these steps fails, the displayed Chinese characters are garbled, which is a common Java Chinese problem.
At the same time, Java is a cross-platform programming language, that is, the programs we write can not only run on Chinese Windows, but also on Chinese Linux and other systems, at the same time, it is required to run on systems such as English (we often see that some people have transplanted Java programs written on the Chinese Win2k to English Linux to run ). This kind of porting operation will also cause Chinese problems.
In addition, some people use English operating systems, Internet Explorer and other browsers to run programs with Chinese characters and browse Chinese Web pages. They do not support Chinese characters and may also cause Chinese problems.
Yes, almost all the browsers by default when passing parameters are in UTF-8 encoding format to pass, rather than by Chinese encoding transfer, so, when passing Chinese parameters will also have problems, resulting in garbled phenomenon.
In short, the above aspects are the main source of Chinese problems in Java. We call the problems caused by the failure of the program to run correctly due to the above reasons: Java Chinese problems.
2. detailed process of Java encoding and conversion
Common Java programs include:
* Classes that run directly on the console (including visual interface classes)
* JSP code class (Note: JSP is a variant of the servlets class)
* Servelets class
* EJB class
* Other support classes that cannot be directly run
These class files may contain Chinese strings, and we often use the first three types of Java programs to directly interact with users for output and input characters, such: we get the characters sent from the client in JSP and Servlet, which also contain Chinese characters. Regardless of the role of these Java classes, the lifecycle of these Java programs is as follows:
* The programmer selects an appropriate editing software on a certain operating system to implement the source code and. the Java extension is stored in the operating system. For example, you can use NotePad to edit a Java source program in Win2k;
* Programmers use javac.exe in JDK to compile the source code to form a. Class class (JSP files are compiled by the container by calling JDK );
* Directly run these classes or deploy these classes to Web containers for running and output the results.
In these processes, how does JDK and JVM encode, decode, and run these files?
Here, we use the Chinese Win2k operating system as an example to illustrate how Java classes are encoded and decoded.
Step 1: compile a Java source program file (including the above five types of Java programs) with editing software such as notepad in Win2k ), by default, program files are saved in the GBK encoding format supported by the operating system (the default format supported by the operating system is file. encoding format) to form. java files, that is, before the Java program is compiled, our Java source program files use the default file supported by the operating system. the encoding format is saved. The JAVA source program contains Chinese characters and English program code. You need to view the system file. you can use the following code to encoding the parameter:
Public class showsystemdefaultencoding {
Public static void main (string [] ARGs ){
String encoding = system. getproperty ("file. encoding ");
System. Out. println (encoding );
}}
Compile first obtains the default encoding format used by the operating system, that is, when compiling a Java program, if we do not specify the encoding format of the source program file, JDK first obtains the file of the operating system. the encoding parameter (which stores the default encoding format of the operating system, such as Win2k, whose value is GBK). Then, JDK extracts our Java source program from file. the encoding format is converted to the Java internal default Unicode format and placed into the memory. Then, javac compiles the converted unicode format file. class file. the class file is unicode encoded and is temporarily stored in the memory. Then, JDK saves the compiled class file encoded with Unicode to our operating system to form what we see. class file. For us, what we finally get. A class file is a class file whose content is saved in Unicode encoding format. It contains a Chinese character string in our source program, but it has been written by file. the encoding format is converted to the unicode format.
In this step, the JSP source code files are different. For JSP, the process is as follows: that is, the Web Container calls the JSP compiler, the JSP compiler first checks whether the JSP file has a file encoding format. If the JSP file does not have a JSP file encoding format set, the JSP compiler calls JDK to use the default JVM character encoding format (that is, the default file of the operating system where the Web container is located) for the JSP file. encoding) is converted to a temporary servlet class, then compiled into a class in unicode format, and saved in a temporary folder. For example, in the Chinese Win2k, the Web Container converts the JSP file from the GBK encoding format to the unicode format and then compiles it into a temporarily saved servlet class to respond to user requests.
Step 3: run the classes compiled in step 2:
A. classes run directly on the console
B. EJB class and support class that cannot be directly run (such as JavaBean class)
C. JSP code and Servlet class
D. Between Java programs and databases
Let's look at these four situations.
A. classes run directly on the console
In this case, JVM is required to run this class, that is, JRE must be installed in the operating system. The running process is as follows: first, start JVM in Java. At this time, JVM reads the class file stored in the operating system and reads the content into the memory. At this time, the class in unicode format is used in the memory, then the JVM runs it. If this class needs to receive user input at this time, the class uses file by default. the encoding format encodes the string you entered and converts it to Unicode and saves it To the memory (you can set the encoding format of the input stream ). After the program runs, the generated string (unicode encoded) is handed back to JVM, and then the JRE converts the string to file. the encoding format (you can set the encoding format of the output stream) is passed to the operating system display interface and output to the interface.
The conversion of each step above requires correct encoding format conversion to avoid garbled characters.
B. EJB class and support class that cannot be directly run (such as JavaBean class)
Because EJB classes and support classes that cannot be directly run, they generally do not directly interact with users for input and output. They often interact with other classes for input and output, therefore, after the second step is compiled, the classes whose content is unicode encoded are saved in the operating system, in the future, as long as its interaction with other classes is not lost during parameter transmission, it will run correctly.
C. JSP code and Servlet class
After step 2, the JSP file is also converted to a servlets file, but it does not exist in the classes directory like the standard servlets one, it exists in the temporary directory of the Web container, in this step, we also use it as the servlets.
For Servlets, when the client requests it, the Web Container calls its JVM to run the servlet. First, the JVM reads the servlet class from the system and loads it into the memory, the servlet class code in the memory is unicode encoded, and then the JVM runs the servlet class in the memory. If the servlet is running, it needs to accept characters sent from the client, such: the value entered in the form and the value entered in the URL. If no encoding format is set in the program, the Web Container uses the ISO-8859-1 encoding format by default to accept incoming values and relay to unicode format in the memory of the Web Container in JVM. After the servlet runs, the output string is in unicode format. Then, the container runs the Unicode string generated by the servlet (such as HTML syntax and user output string) it is directly sent to the client browser and output to the user. If the encoding format specified for sending is specified, it is output to the browser according to the specified encoding format. If not specified, by default, it is sent to the client's browser in ISO-8859-1 encoding.
D. Between Java programs and databases
For almost all the JDBC drivers of the database, the default transfer data between the Java program and the database is in the ISO-8859-1 as the default encoding format, so, when our program stores data containing Chinese characters to the database, JDBC first converts the data in the Unicode encoding format inside the program to the ISO-8859-1 format, and then passes it to the database, when the database saves the data, it is saved by ISO-8859-1 by default, so this is why the Chinese data we often read in the database is garbled.

3. Several principles that must be clarified when analyzing common Java Chinese problems
First of all, after detailed analysis above, we can clearly see that the key process of coding conversion for any Java program in its lifecycle is: the transcoding process that is initially compiled into a class file and ultimately output to the user.
Secondly, we must understand the following common encoding formats supported by Java during compilation:
* ISO-8859-1, 8-bit, with 8859_1, ISO-8859-1, iso_8859_1 and Other encoding
* Cp1252, American English code, same as ANSI Standard Code
* UTF-8, same unicode encoding
* Gb2312, same as gb2312-80, gb2312-1980, etc.
* GBK, same as ms936, is an extension of gb2312.
And other codes, such as Korean, Japanese, and traditional Chinese. At the same time, we should note that the compatibility between these encodings is as follows:
Unicode and UTF-8 encoding are a one-to-one relationship. Gb2312 can be considered as a subset of GBK, that is, GBK encoding is extended on gb2312. At the same time, GBK encoding contains 20902 Chinese characters in the range of 0x8140-0xfefe. All the characters can correspond to unicode2.0 one by one.
Again, for the. Java source program file stored in the operating system, we can specify the encoding format of its content during compilation. Specifically, we can use-encoding to specify it. Note: If the source program contains Chinese characters and you use-encoding to specify other encoding characters, it is obviously wrong. Use-encoding to specify the source file encoding method as GBK or gb2312. No matter what system we compile a Java source program containing Chinese characters, it will correctly convert Chinese to Unicode and store it in the class file.

Then, we must be clear that almost all web containers in their internal default character encoding formats are based on ISO-8859-1 as the default value, at the same time, almost all browsers PASS Parameters in UTF-8 by default. Therefore, although our Java source file specifies the correct encoding method at the entrance, it is also handled by ISO-8859-1 when running inside the container.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.