Java Chinese display garbled

Last Update:2017-06-12 Source: Internet

Author: User

Tags locale websphere application server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The origin of the problem

Each country (or region) provides a set of character encodings for computer information interchange, such as the extended ASCII code of the United States, China's gb2312-80, Japan's JIS, etc., as the basis of information processing in the country/region, with the important role of unified coding. The character encoding set is divided into SBCS (single-byte character set) and DBCS (double-byte character set) by length. Early software (especially the operating system), in order to solve the local character information computer processing, there have been various localized versions (L10N), in order to differentiate, introduced the LANG, Codepage and other concepts. However, due to the overlapping of the local character set code, it is difficult to exchange information with each other, and the software has higher independent maintenance cost for each localized version. Therefore, it is necessary to extract the commonality in the localization work, and to make a consistent processing, so that the special localization processing content is minimized. This is also called internationalization (i18n). Various language information is further regulated as locale information. The underlying character set for processing becomes Unicode, which contains almost all glyphs.

Most of the software core character processing with internationalized features is now based on Unicode, which determines the local character encoding settings based on the Locale/lang/codepage settings at the time of the software operation and handles local characters accordingly. The conversion between Unicode and local character sets is required during processing, or even two different local character sets in the middle of Unicode. This approach is further extended in the network environment, and the character information on either side of the network needs to be converted to acceptable content based on the settings of the character set.

Within the Java language, Unicode is used to represent characters, and Unicode V2.0 are respected. A Java program can convert a character encoding either from/to the file system to read/write a file in a character stream, to write HTML information to a URL connection, or to read a parameter value from a URL connection. This, while increasing the complexity of programming, is prone to confusion, but it is in line with the idea of internationalization.

In theory, these character conversions based on character set settings should not cause too many problems. The fact is that because of the actual operating environment of the application, the addition and refinement of Unicode and individual local character sets, as well as the non-specification of system or application implementations, problems with transcoding often plague programmers and users.

2. gb2312-80,gbk,gb18030-2000 Character Set and Encoding

In fact, the method of solving the Chinese character coding problem in JAVA program is often very simple, but understanding the reason behind it, locating the problem, also need to understand the existing Chinese character coding and encoding conversion.

GB2312-80 was developed at the initial stage of the development of Chinese computer character information technology, which contains most commonly used secondary characters, and 9-zone symbols. This character set is the Chinese character set supported by almost all Chinese system and internationalized software, which is also the most basic Chinese character set. Its coding range is high 0xa1-0xfe, Low is also 0xa1-0xfe; Chinese characters start from 0xb0a1 and end in 0xf7fe;

GBK is an extension of gb2312-80 and is upward compatible. It contains 20,902 Chinese characters, and its encoding range is 0x8140-0xfefe, rejecting the bit of high 0x80. All of its characters can be mapped to Unicode 2.0 one-to-one, meaning that JAVA actually provides support for the GBK character set. This is the default character set for Windows and some other Chinese operating systems, but not all internationalized software supports the character set, and it feels like they don't fully know what's going on with GBK. It is important to note that it is not a national standard, but a norm. With the release of gb18030-2000 GB, it will complete its historical mission in the near future.

gb18030-2000 (GBK2K) further expands the Chinese characters on the basis of GBK, and adds the glyphs of Tibetan and Mongolian minorities. GBK2K fundamentally solves the problem that the word bit is not enough and the glyph is insufficient. It has several characteristics,

It does not define all the glyphs, but only specifies the coding range to be extended later.
The encoding is long, and the second byte part is compatible with GBK; The four-byte section is an expanded glyph, Word bit, whose encoding range is first byte 0x81-0xfe, two bytes 0x30-0x39, three bytes 0x81-0xfe, four bytes 0x30-0x39.
Its generalization is phased, and first requires that all glyphs that are fully mapped to the Unicode 3.0 standard be implemented.
It is a national standard and is mandatory.
Now there is no operating system or software to achieve GBK2K support, this is the current stage and the future of the work of the Chinese.
Introduction to Unicode ... Let's skip it.

JAVA-supported encoding are related to Chinese programming: (There are several ASCII 7-bit that are not listed in the JDK documentation), with ASCII7
Iso8859-1 8-bit, with 8859_1,iso-8859-1,iso_8859-1,latin1 ...
Gb2312-80 with gb2312,gb2312-1980,euc_cn,euccn,1381,cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB ...
GBK (note case), same as MS936
UTF8 UTF-8
GB18030 (now only IBM JDK1.3.? Supported), with cp1392,1392

The JAVA language uses Unicode processing characters. But from another point of view, in the Java program can also adopt non-Unicode transcoding, it is important to ensure that the program entrance and exit of the Chinese character information is not distorted. Using Iso-8859-1 to deal with Chinese characters can also achieve the correct results. Many of the solutions prevalent on the web are of this type. In order not to cause confusion, this article does not discuss this method.

3. Chinese transcoding "'? ', the origin of garbled characters

Two directional conversions are likely to get the wrong result:

Unicode-->byte, if the target code set does not already have the corresponding code, the result is 0x3f.
Such as:
"\u00d6\u00ec\u00e9\u0046\u00bb\u00f9". GetBytes ("GBK") cavity assigns loam "foot Valley F", Hex speed loam 3fa8aca8a6463fa8b4.
Looking closely at the results above, you will find that \U00EC is converted to 0xa8ac, \u00e9 is converted to \xa8a6 ... Its actual effective bit is getting longer! This is because some of the symbols in the GB2312 symbol are mapped to some common symbol encodings, because these symbols appear in iso-8859-1 or some other SBCS character sets, so they are encoded in Unicode before they are more than 8 bits, and the encoding of the Chinese characters overlaps ( In fact, this mapping is just the mapping of the encoding, when displayed carefully is not the same. The symbols in Unicode are single-byte-wide, and the symbols in Chinese characters are double-byte-wide. There are 20 of these symbols between unicode\u00a0--\u00ff. It is important to understand this feature! It is not difficult to understand why Java programming, Chinese character coding error results often appear in some garbled (in fact, symbolic characters), and not all "?" Characters, like the example above.

Byte-->unicode, if the character identified by byte does not exist in the source code set, the resulting result is 0xfffd.
Such as:
byte ba[] = {(byte) 0x81, (Byte) 0x40, (Byte) 0xb0, (byte) 0xa1};new String (BA, "gb2312");
The result is "? Ah", the hex value is "\ufffd\u554a". 0x8140 is the GBK character, GB2312 conversion table does not have a corresponding value, take \ufffd. (Note: When this Unicode is displayed, there is no corresponding local character, so the last case is shown as a "?".)

In the actual programming, the Jsp/servlet program obtains the wrong Chinese character information, often is the superposition of these two processes, sometimes even is the two process superposition after the repeated effect result.

4. Jsp/servlet Chinese character coding problem and the solution in was

4.1 Common phenomena of encoding problems
Jsp/servlet encoding problems commonly appearing on the Internet are generally expressed in the browser or application side, such as:
How did the Chinese characters in the Jsp/servlet page seen in the browser become '? '?
How are the Chinese characters in the Servlet pages that are seen in the browser garbled?
How do Chinese characters in the JAVA application interface become squares?
Jsp/servlet Page Cannot display GBK kanji.
The JSP page embedded in the <%...%>,<%=...%> and other tags contained in JAVA code is garbled, but the pages of the other Chinese characters are right.
Jsp/servlet cannot receive the Chinese characters submitted by the form.
Jsp/servlet database read/write failed to get the correct content.
Hidden behind these problems are the various wrong character conversions and processing (except for the 3rd, which is caused by a Java font setting error). To solve similar character encoding problems, you need to understand the jsp/servlet running process and examine the various points that may be problematic.

4.2 Encoding issues when jsp/servlet Web programming
Jsp/servlet, which runs on the Java application Server, provides HTML content for Browser, as shown in the following procedure:

Where there are character encoding conversions:

JSP compilation. The Java application Server reads the JSP source files according to the JVM's file.encoding value, compiles the Java source files, and then writes back to the file system based on the file.encoding value. If the current system language supports GBK, then there is no encoding problem at this time. If the system is in English, such as LANG is en_US Linux, AIX or Solaris, the file.encoding value of the JVM is set to GBK. If the system language is GB2312, if necessary, determine if you want to set file.encoding, set file.encoding to GBK to resolve potential GBK character garbled problems

Java needs to be compiled into a. class to execute in the JVM, a process that has the same file.encoding problem as a. From here the servlet and JSP run like this, except that the servlet's compilation is not automatic. For JSP programs, the compilation of the resulting Java intermediate file is automated (calling the Sun.tools.javac.Main class directly in the program). So if there is a problem with this step, check the locale of the encoding and OS, or convert the static character embedded in JSP Java code to Unicode, or static text output in Java code. For Servlets, javac the-encoding parameter is manually specified at compile time.

The

Servlet needs to convert the contents of the HTML page to browser acceptable encoding content to send out. Depending on how each JAVA App Server is implemented, some will query Browser's accept-charset and Accept-language parameters or determine encoding values in other guesses, or whatever. Therefore, the use of fixed encoding may be the best solution. For Chinese web pages, you can set contenttype= "text/html;charset=gb2312" in a JSP or Servlet, or set to contenttype= if there are GBK characters in the page text/html;charset= GBK ", since IE and Netscape have different levels of support for GBK, it is necessary to test this setup.
because 16-bit JAVA char is discarded when the 8 bits are sent over the network, and to ensure that the characters in the servlet page (including the embedded and servlet runs) are the desired inner code, you can use PrintWriter out= Res.getwriter () replaces Servletoutputstream Out=res.getoutputstream (). The printerwriter will be converted according to the charset specified in the ContentType (contenttype need to be specified before this!). ), or you can use OutputStreamWriter to encapsulate the Servletoutputstream class and output the Chinese character string with write (string).
for Jsp,java application Server should be able to ensure that embedded Chinese characters are transferred correctly at this stage.

This is an explanation of the URL character encoding problem. If the parameter value returned from browser is included in the Get/post method, the servlet will not be able to get the correct value for the character information. In Sun's j2sdk, Httputils.parsename does not consider the browser language setting at all when parsing parameters, but resolves the resulting values in byte. This is the most discussed encoding problem on the Internet. Because this is a design flaw, the resulting string can only be re-parsed in bin mode, or in the Hack httputils class. Refer to Article 2 are introduced, but it is best to the Chinese encoding GB2312, CP1381 are changed to GBK, otherwise encountered GBK kanji, there will be problems.
Servlet API 2.3 Provides a new function httpserveletrequest.setcharacterencoding used to specify what the application wants before calling Request.getparameter ("Param_name") Encoding, this will help to solve the problem completely.
4.3 Workarounds in IBM Websphere application Server

WebSphere Application Server extends the standard Servlet API 2.x to provide better multi-lingual support. In Chinese operating systems, Chinese characters can be handled well without any setup. The following instructions are only valid if was is a system running in English, or if GBK support is required.

奻扴 c,d table ㄛwas drink Lynx shen 戙 Browser cavity 逄晟扢 from Iiberai Gustav? Hum Ðžð table Next ㄛzh, ZH-CN expansion XI stubble defending lacking JAVA encoding Cp1381ㄗ蛁 forwarding ㄩcp1381 chamber loam swelling Quilt GB2312 cavity carrot stamping codepageㄛ mutton epistaxis GBK Welfare Jue ㄘ(class 欴酕 contact calender Loam 秪 lacking 拸 yang?? Browser tu EPD Partnership cavity departure brazing Ionizing 苀 loam welfare, lame GB2312 loam Gbkㄛ垀?? Kohlrabi (bamboo shoots loam jean Hide Cavity Garland 蚚 ionizing 苀 lame loam lynx? Home 笢 Dike Federation GBK Luo callus iiberai 郔翍 loam 紾軞 risotto callus 笢 cavity ※ 噷 "(rong2ㄛ0xe946ㄛ\u9555) iiberai 垀眕 epistaxis en loam 剒 Lynx Wei Encoding/charset luo The lacking gbk(twist? Was 笢 Cao Cai? Hum cavity encoding mutton epistaxis 奻 splash cavity rao pound old Iiberai sepcifications qin A,bㄛ banana 恅 organization 5ㄛ Gustav Application Server cavity Tao Zhuangchai EPD Partnership System Genie 笢 Luo corner-dfile.encoding=gbk fishing prettiness · Sepcifications Qin Dㄛ Gustav application Server cavity Tao Zhuangchai EPD Partnership System Genie 笢 Luo,-ddefault.client.encoding=gbk(? Luo, calculations-ddefault.client.encoding=gbkㄛ Rao c? Table Next prettiness 眕 Xiang Kinky luo charset(

There is also an issue in the questions listed above that the static text contained in the JAVA code in tag<%...%>,<%=...%> is not displayed correctly, but the workaround in was is in addition to setting the correct file.encoding, You also need to set-DUSER.LANGUAGE=ZH-DUSER.REGION=CN in the same way. This is related to the settings for the Java locale.

4.4 Encoding problems when reading and writing databases

Another place where encoding problems often occur in Jsp/servlet programming is the data in the read-write database.

The popular relational database system supports database encoding, which means that its own character set settings can be specified when the database is created, and that the database data is stored in the specified encoding format. When an application accesses data, there is a encoding conversion at both the entrance and exit. For Chinese data, the database character encoding settings should guarantee the integrity of the data. Gb2312,gbk,utf-8 are optional database encoding, or iso8859-1 (8-bit) can be selected, then the application must write the data before writing a Chinese character or Unicode 16Bit into two 8-bit characters, After reading the data, you need to combine two bytes, and also distinguish the SBCS characters. The function of database encoding is not fully utilized, but the complexity of programming is increased, iso8859-1 is not the recommended database encoding. Jsp/servlet programming, you can first use the management function provided by the database management system to check whether the Chinese data is correct.

It should then be noted that the Encoding,java program of the data being read is generally Unicode. The opposite is when writing data.

4.5 Common techniques for locating problems

Locating the Chinese encoding problem is usually the stupidest and most effective way to print the inner code of a string after you think a suspect program has been processed. By printing the inner code of the string, you can find out when the Chinese characters are converted to Unicode, when the Unicode is returned to the Chinese code, when the text is two Unicode characters, when the string is translated into a string of question marks, When is the high of the Chinese string truncated ...

Taking the appropriate sample string also helps to differentiate between types of problems. such as: "AA ah AA 丂 AA" and other Chinese and English, GB, GBK character strings. In general, no matter how the English characters are converted or processed, it will not be distorted (if encountered, you can try to increase the length of consecutive English letters).

5. Concluding remarks

In fact, Jsp/servlet Chinese encoding is not as complex as imagined, although the location and solve the problem is not confine, all kinds of operating environment is not necessarily, but the principle behind is the same. Understanding Character Set knowledge is the basis for solving character problems. However, with the change of the Chinese character set, not only Java programming, the problems in the application of the processing will still exist for some time.

Java Chinese display garbled

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More