Question about Chinese character encoding in JSP

Last Update:2013-10-28 Source: Internet

Author: User

Tags websphere application server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There are many excellent articles and discussions on the issue of DBCS character encoding in JSP/Servlet on the Internet. This article will organize them and combine them with IBM WebSphere Application Server 3.5 (WAS) in the hope that it is not redundant.

Content:

Origin

GB2312-80, GBK, GB18030-2000 Chinese Character Set and Encoding

Chinese transcoding time ′? '. Sources of garbled characters

JSP/Servlet Chinese character encoding and solutions in WAS

Conclusion

References

1. Origin of the problem

Each country (or region) specifies the character delimiter set for computer information exchange, such as the extended ASCII code of the United States, the GB2312-80 of China, JIS of Japan, etc, as the basis for information processing in the country/region, it plays an important role in unified coding. The character Collation is divided into SBCS (single-byte character set) and DBCS (dubyte Character Set) by length. Early software (especially the operating system), in order to solve the computer processing of local character information, various local versions (L10N) were introduced. to distinguish, LANG, Codepage and other concepts were introduced. However, the Code ranges of local character sets overlap, making it difficult to exchange information between them. The independent maintenance costs of each localized version of the software are high. Therefore, it is necessary to extract the commonalities in the localization work for consistent processing, so as to minimize the content of special localization processing. This is also called I18N ). The language information is further standardized as Locale information. The underlying character set to be processed becomes Unicode that contains almost all glyphs.

Currently, most of the software's core Character Processing Systems with internationalization features are Unicode-based. During software running, the corresponding local character encoding settings are determined based on the Locale/Lang/Codepage settings at that time, and handle local characters accordingly. In the process, Unicode and local character sets must be converted to each other, or two different local character sets with Unicode as the center must be converted to each other. This method is further extended in the network environment. The character information at both ends of any network needs to be converted to acceptable content according to the character set settings.

The Java language uses Unicode to represent characters and complies with Unicode V2.0. Java programs can convert character codes to read/write files in streams from/to the file system, write HTML information to URL connections, or read parameter values from URL connections. Although this method increases programming complexity and can cause confusion, it is in line with the idea of internationalization.

Theoretically, character Conversion Based on Character Set settings should not cause too many problems. The fact is that the actual running environment of applications is different. Unicode is supplemented and improved with local character sets, and the implementation of systems or applications is not standardized, the problems encountered during transcoding have always plagued programmers and users.

2. GB2312-80, GBK, GB18030-2000 Chinese Character Set and Encoding

In fact, the method to solve the problem of Chinese character encoding in JAVA programs is often very simple, but to understand the reasons behind it, to locate the problem, you also need to understand the existing Chinese character encoding and encoding conversion.

GB2312-80 is made in the initial stage of the development of Chinese character information technology in China, which contains most of the commonly used first-and second-level Chinese characters, and 9-area symbols. This character set is supported by almost all Chinese systems and international software. It is also the most basic Chinese character set. The encoding range is high: 0xa1-0xfe; low: 0xa1-0xfe; Chinese characters start from 0xb0a1 and end with 0xf7fe;

GBK is an extension of the GB2312-80 and is upward compatible. It contains 20902 Chinese characters and Its Encoding range is 0x8140-0xfefe, excluding the characters with a high position of 0x80. All its characters can be mapped to Unicode 2.0 one-to-one. That is to say, JAVA actually supports the GBK character set. This is the default character set for Windows and some other Chinese operating systems at present, but not all international software support this character set. It seems that they do not fully understand what GBK is. It is worth noting that it is not a national standard, but a standard. With the launch of the GB18030-2000 national mark, it will fulfill its historical mission in the near future.

On the basis of GBK, GB18030-2000 (GBK2K) further expands Chinese characters and adds the fonts of Tibetan and Mongolian ethnic minorities. GBK2K fundamentally solves the problem of insufficient characters and insufficient fonts. It has several features,

It does not determine all the glyphs, but only specifies the encoding range, which will be extended later.

The encoding is variable, and the second part is compatible with GBK. The four-byte part is the expanded font and character bit, the encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, and four byte 0x30-0x39.

It is promoted in stages. The first requirement is that it can be fully mapped to all fonts of the Unicode 3.0 standard.

It is a national standard and mandatory.

At present, no operating system or software has supported GBK2K. This is the work of current and future localization.

Unicode introduction.

The encoding supported by JAVA is related to Chinese programming: (several of them are not listed in the JDK Documentation)

ASCII 7-bit, same as ascii7

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More