Several principles of analysis on Java Chinese problems

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Questions | Chinese

Introduction

Although there is a number of discussions on Java Chinese processing, there are no official standards for Java technology, because of its wide range of content (more than 10 related technologies), a wide variety of technology vendors, Java-oriented Web servers, application servers, and JDBC database drivers. So the Java application in the process of processing Chinese in the existing problems, but also with the choice of servers, drivers of different Java Chinese problems caused by the variability, increased the complexity of the problem. So how can we find the crux of the problem in so numerous phenomena?

General solutions to Java Chinese problems

In fact, Java's Chinese problems are due to the default encoding format used by Java applications that differs from the target or the encoding format in which the application reads characters (see document 1). There are usually four ways to solve Java's Chinese problems:

1 Select the Chinese localized version of the JDK. Although the Chinese localized version of the Java2 JDK (http://java.sun.com/products/jdk/1.2/ chinesejdk.html) is not an official version, and Sun does not promise to upgrade the localized version, but it is still a solution to the Java Chinese problem.

2 Select the appropriate compilation parameters. For the international version of Java, we can also compile Java applications by specifying a defined encoding mechanism to support the results of their compilation in Chinese. For example, the source program can be compiled by javac-encoding Big5 Sourcefile.java and javac-encoding gb2312 Sourcefile.java to support Traditional Chinese and Simplified Chinese applications.

3 The conversion code of the character code is realized by the way of programming. It has become a common practice to solve Java's Chinese problems programmatically. The following is one of the most common character encoding conversion functions, which converts the encoded format of characters into the GBK encoded form of the Chinese Windows system.

public static string Tochinese (string strvalue)   {         try{             if (strvalue==null) return                null;             else             {                strvalue = new String (strvalue.getbytes ("Iso8859_1"), "GBK");                return strvalue         }         } catch (Exception e) {return               null;         }   }

4) Define the character output set. For JSP application, we can through <%@ page contenttype= "text/html;" CHARSET=GBK "%> or <%@ page contenttype=" text/html; charset=gb2312 "%> to define the character output set of the JSP page. Of course, we can also use HTML tags <meta http-equiv= "content-type" content= text/html; charset=gb2312 > To define the output set of the character.

The problems that exist

Depending on how the method is implemented, we can divide the above four methods into two categories, one is that by using some standard or rule to implement the method, above 1, 2, 4) All belong to this class, one kind is through the specific programming to implement the method, the method mentioned above 3) belongs to this class.

Because Method 1, 2), 4 is a kind of normative method, so the method is relatively simple, the solution does not have a larger pertinence, more general, for example, we can use method 2 to compile Java source files to achieve the preset of the inner code, Regardless of what part of the source code in the end there is a Java Chinese processing problems, such as output garbled and so on.

However, because these methods are not targeted, the solution to the problem is too uniform, so in some cases they do not completely solve the Java Chinese problem. Give a very common example. In general, users ' Java applications often need to interact with other Java application interfaces, such as accessing a database through some version of JDBC. Because the code that the JDBC driver supports varies with the provider and even the version, so if in the database input and output process in Chinese can not correctly handle the problem, we need in the data input and output process to do two times the exact opposite of the encoding conversion, which for Method 1, 2, 4), are often impossible to solve. Of course, for Method 2, we can also use some tricks to meet the above situation, one of the most effective way is to try to make the Java application of the various parts of the component. For example, we can compile the input and output code of a database into different source files to meet different character encoding requirements. But the usual programming is unlikely to meet this requirement, because the result of this procedure is likely to be unreasonable. For example, it is a more appropriate design to encapsulate the read and write methods of a database into a class, but it would be very unreasonable to implement the two methods of the class in two files respectively. So for 1, 2), 4 the method, although the implementation is relatively simple, but has some insurmountable shortcomings. This is also the reason that the relatively complex programming approach is popular.

As opposed to Method 1, 2), 4), Method 3 has better pertinence and flexibility. Procedures can be flexible to deal with different situations, the character encoding is converted in any place, but the characteristics of the method also require a higher demand for the software developer--must be able to accurately capture where the Chinese problem is likely to occur, and make the right judgments and treatments.

Principles of the analysis

All in all, the solution to Java Chinese processing is not very complicated. On the contrary, because Java technology, especially the EE technology involved a wide range of Web servers, application servers and JDBC database drivers are uneven, so how to correctly and timely discovery of the application of the Chinese processing problem has become more complex. So how do we find these problems?

Typically, Java's handling of Chinese is caused by the fact that the default encoding format used by the user's Java application is different from the target or the encoding format in which the application reads characters. One of the main causes of these differences is the user's Java application and other applications of the encoding format mismatch data exchange (including direct or indirect data input, output). Therefore, in order to detect problems in time, we can start with this, according to the following principles of application analysis:

Note the character variable condition. Because the character encoding form of the variable is more covert, the change of the numeric value and the operation of the multiple variables may cause the change of character set. In the various operations of the variables and the data submitted by the page, it is easier to perform the operation of different coded format characters.
Note that any form of character reads and outputs. The reason to mention any form is that Java applications are mostly developed as Web applications, so Java applications need to face a wide variety of character data interchange forms in the web world, compared to other language applications. For example, various forms of data submission, URL form of data reading, encrypted character data exchange, Web control selection results input, control content display (such as list control) and so on.
Be careful to use third party components and applications. Because the implementation of THIRD-PARTY components and applications is non-transparent, it is generally difficult to tell what the default encoding format of these components or drivers is, or to control them. Therefore, when using the interface functions they provide for data exchange, special attention should be paid to the fact that if the Chinese do not correctly handle the situation, you should first examine our own code and adjust the relevant code to adapt to these interfaces, because these components or applications do not provide an interface to adjust the encoding mechanism. If necessary, we may need to adopt other replaceable components or applications.
Note the data input and output that the requested object contains. This is a very covert kind of situation, when our application interacts with objects (such as serialized objects), if the object contains the processing of character data, or the input and output of some data, or even throws a paragraph with the exception of the Chinese annotation, it may appear that Chinese can not be correctly displayed and so on. Since these behaviors are often encapsulated in objects, it is easy to overlook this possibility when writing programs. And this is a certain unpredictability, for example, we may not know when the object will throw what kind of exception, so we need to do some testing work.

Note The data access process for the database. Java establishes a connection with the database through JDBC. For the JDBC driver, since most of the JDBC drivers are not designed for Chinese systems (Chinese data are mostly iso-8859-1 encoded), the conversion of character encoding is often required in the process of data reading and writing. However, we still recommend that users read the instructions carefully when using these JDBC drivers. If you really can't figure out what the code for JDBC character data is, our advice is to do the necessary testing. For example, the following is a group of code that correctly reads Chinese characters from Ms SQL Server2000 in the Simplified Chinese Win2000 platform using the JDBC driver provided by WebLogic 6.0 (for example, character operations):

... Class.forName ("Weblogic.jdbc.mssqlserver4.Driver"). newinstance ();  conn = Mydriver.connect ("Jdbc:weblogic:mssqlserver4", props);      Conn.setcatalog ("Labmanager");     Statement st = Conn.createstatement ();//execute a querystring  teststr; String testtempstr = new string ();    Teststr = new String (testtempstr.getbytes ("iso-8859-1"));//Encoding Conversion DatabaseMetaData dbmetadata =conn.getmetadata (); ResultSet rs = dbmetadata.gettables (null, null,null,new string[]{"TABLE"}); while (Rs.next ()) {for (int j=1; j<= Rs.getmetadata (). getColumnCount (); J + +) {teststr = Teststr +string (Rs.getobject (j). ToString (). GetBytes ("iso-8859-1"));}

However, it should be noted that different JDBC drivers support the same database differently, while the same JDBC driver supports different databases differently, meaning that our character conversion code may not work correctly in the case of JDBC-driven change or even version change. For example, for the above example, it is not possible to handle Chinese correctly in the same environment when you switch to i-net una Driver Version 2.03 for MS SQL Server. The reason is simple, the JDBC driver itself supports the GBK encoding mechanism, so there is no need to do any coding conversion.

6 the necessary testing. Because the Java Chinese problem arises with the Web server, browser, operating environment and development tools can change, so in order to better avoid the occurrence of problems, we must do some targeted testing. In addition, the test work becomes very important when we do not have an analysis to determine whether Java's Chinese processing problems are likely to occur or if the problem occurs because of which links (Web servers, browsers, JDBC data drivers, and so on) are causing. And we may need more comprehensive testing, such as Web servers, browsers, and JDBC data drivers, to test, so that we can identify the problems that are hidden in the coordination process of multiple links.

Conclusion

In fact, the root cause of the problem with Java Chinese processing is that the encoded format of the manipulated Chinese character (variable) is different from the encoding format of the target. All these problems are actually occurred in the character of the reading, output process, as long as we grasp this link, we can better discover, analyze, Handle and prevent Java problems in Chinese.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Several principles of analysis on Java Chinese problems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Several principles of analysis on Java Chinese problems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support