The problem of encoding in Jsp/servlet

Source: Internet
Author: User
Tags character set contains file system locale websphere application server

There are a number of excellent articles and discussions on the internet about DBCS character encoding in Jsp/servlet, and this article makes some collation of them, and makes some explanations with IBM WebSphere Application Server 3.5 (WAS) solution, hoping it is not redundant.

Content:

The origin of the problem

gb2312-80,gbk,gb18030-2000 Character Set and Encoding

The origin of ´?´ and garbled code when Chinese transcoding

Jsp/servlet encoding problem and its solution in was

Conclusion

Reference articles

1. The origins of the problem

Each country (or region) prescribes a set of character encodings for computer information interchange, such as extended ASCII in the United States, Chinese gb2312-80, JIS of Japan, etc., which is the basis of information processing in the country/region, and has the important role of unified coding. The character encoding set is divided into SBCS (Single-byte character set) by length, DBCS (double-byte character set) two broad categories. Early software (especially the operating system), in order to solve the local character information computer processing, the emergence of various localized versions (L10N), in order to distinguish between the introduction of Lang, Codepage and other concepts. However, because of the overlapping of the local character set code scope, the information exchange between each other is difficult, each localized version of the software has higher cost of independent maintenance. Therefore, it is necessary to extract the commonality from the localization work, and make a consistent treatment to minimize the specific localized processing. This is also called internationalization (i18n). Various language information is further normalized to Locale information. The underlying character set that is processed becomes Unicode, which contains almost all glyphs.

Now most of the software core character processing with internationalized features is based on Unicode, the local character encoding setting is determined according to the Locale/lang/codepage setting at the time of software running, and the local characters are processed accordingly. There is a need to implement conversion between Unicode and the local character set during processing, or even between two different local character sets in Unicode. This approach is further extended under the network environment, and the character information at both ends of the network needs to be converted to acceptable content according to the set of character set.

The interior of the Java language is a Unicode representation of characters, followed by Unicode V2.0. The Java program has character-coded conversions whether it reads/writes files from/to a file system, writes HTML information to a URL, or reads parameter values from a URL connection. This increases the complexity of programming and leads to confusion, but it is in line with international thinking.

In theory, these character conversions based on character set settings should not cause too many problems. The fact is that because of the actual operating environment of the application, Unicode and the local character set complement, perfect, and the system or application implementation of the nonstandard, the problem of transcoding is always bothering programmers and users.

2. gb2312-80,gbk,gb18030-2000 Character Set and Encoding

In fact, the solution to the encoding problem in JAVA programs is often very simple, but understand the reasons behind it, positioning problems, but also need to understand the existing encoding and coding conversion.

GB2312-80 is developed at the initial stage of the development of Chinese computer Information technology, which contains most commonly used secondary characters and 9-area symbols. The character set is the Chinese character set supported by almost all Chinese systems and internationalized software, which is also the most basic Chinese character set. Its coding range is high 0xa1-0xfe, Low is also 0xa1-0xfe; Chinese characters start from 0xb0a1 and end in 0xf7fe;

GBK is an extension of gb2312-80 and is up-compatible. It contains 20,902 Chinese characters, and its coding range is 0x8140-0xfefe, which eliminates the position of high 0x80. All of its characters can be mapped one-to-one to Unicode 2.0, which means that JAVA actually provides support for the GBK character set. This is the default character set for Windows and some other Chinese operating systems at this stage, but not all internationalized software supports the character set, and it feels like they don't fully know what's going on with GBK. It is noteworthy that it is not a national standard, but only norms. With the release of gb18030-2000 GB, it will complete its historical mission in the near future.

gb18030-2000 (GBK2K) further expands the Chinese characters on the basis of GBK, and increases the glyphs of Tibetan and Mongolian minorities. GBK2K fundamentally solves the problem of insufficient character and short shape. It has several characteristics, it does not determine all the glyphs, just specify the coding range, left to later expansion.

The encoding is variable length, the second byte part is compatible with GBK, the four-byte part is the expanded glyph, the word bit, its encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, four byte 0x30-0x39.

Its generalization is staged and requires that all glyphs that are fully mapped to the Unicode 3.0 standard be implemented first.

It is the national standard, is mandatory.

There is no operating system or software to achieve the GBK2K support, this is the current stage and the future of the work content.

Introduction to Unicode ... Just let it go.

The Java-supported encoding is related to Chinese programming: (several are not listed in the JDK documentation)

ASCII 7-bit, with Ascii7

Iso8859-1 8-bit, with 8859_1,iso-8859-1,iso_8859-1,latin1 ...

Gb2312-80 with gb2312,gb2312-1980,euc_cn,euccn,1381,cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB ...

GBK (note case), with MS936

UTF8 UTF-8

GB18030 (now only IBM JDK1.3.?) have support), with cp1392,1392

The JAVA language uses Unicode processing characters. But from another point of view, in the Java program can also use non-Unicode transcoding, it is important to ensure that the program entrance and export of Chinese character information is not distorted. If the full use of iso-8859-1 to deal with Chinese characters can also achieve the correct results. Many of the solutions that are prevalent on the web fall into this category. In order not to cause confusion, this article does not discuss this method.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.