The Chinese character coding problem in Jsp/servlet

Source: Internet
Author: User
Tags websphere application server

The Chinese character coding problem in Jsp/servlet


1. The origin of the problem
Each country (or Region) specifies a set of character encodings for computer information interchange, such as ASCII in the United states, GB2312, China
-80, Japan's JIS, etc., As the basis of information processing in the country/region, has the important role of unified coding. Character encoding Set Press
The length is divided into SBCS (single-byte character set), DBCS (double-byte character set), two Categories. Early software (especially the operating system),
In order to solve the computer processing of local character information, a variety of localized versions (l10n) have appeared, in order to differentiate, the introduction of LANG,
Codepage and other Concepts. however, due to the overlapping of local character set code, it is difficult to exchange information with each Other. localized versions of software
The cost of this independent maintenance is Higher. therefore, it is necessary to extract the commonality in the localization work, and make the special localization processing
The content is Minimized. This is also called internationalization (i18n). Various language information is further regulated as locale Information. Processing
's underlying character set becomes Unicode that contains almost all glyphs.
Most of the software core character processing with internationalized features is now based on Unicode, and when the software is run
The Locale/lang/codepage setting determines the local character encoding setting and handles local characters accordingly. During the process of processing
You need to convert between Unicode and local character sets, or even two different local character sets in the middle of Unicode
Change. This approach is further extended in the network environment, and the character information at the ends of any network needs to be converted to a set of
Accept the Content.
Within the Java language, Unicode is used to represent characters, and Unicode V2.0 are Respected. Java programs either from/to the file system
Reads/writes a file in a character stream, writes HTML information to a url, or reads a parameter value from a URL connection, with character encoding
Transformation. this, while increasing the complexity of programming, is prone to confusion, but it is in line with the idea of Internationalization.
In theory, these character conversions based on character set settings should not cause too many problems. And the fact is that the application process
The actual operating environment of the sequence, the complement and refinement of Unicode and individual local character sets, and the non-specification of system or application implementations
, the problem of transcoding is always bothering the programmer and the User.
2.gb2312-80,gbk,gb18030-2000 Kanji Character Set
In fact, the method of solving the Chinese character coding problem in JAVA program is often very simple, but understand the reason behind it, locate the problem, still need
To understand the existing Chinese character coding and encoding Transformations.
GB2312-80 was developed at the initial stage of the development of Chinese computer character information technology, which contains most commonly used secondary
kanji, and 9-zone Symbols. This character set is the Chinese character set supported by almost all Chinese system and Internationalized software, which is also
The most basic Chinese character set. Its coding range is high 0xa1-0xfe, low is also 0xa1-0xfe; Chinese characters start from 0xb0a1, knot
Bundle in 0xf7fe;
GBK is an extension of gb2312-80 and is upward compatible. It contains 20,902 Chinese characters, and its encoding range is 0x8140-
0xfefe, eliminate the position of high 0x80. All of its characters can be mapped to Unicode 2.0 one-to-one, meaning that JAVA actually
Support for the GBK character set is provided on The. This is the default character set for Windows and some other Chinese operating systems at this stage, but not
All of the international software supports this character set, and it feels like they don't know exactly what GBK is going on. It is important to note that it is not
National standards, but only norms. With the release of gb18030-2000 gb, it will complete its historical mission in the near Future.
gb18030-2000 (gbk2k) further expands the Chinese characters on the basis of GBK, and adds the glyphs of Tibetan and Mongolian Minorities.
GBK2K fundamentally solves the problem that the word bit is not enough and the glyph is Insufficient. It has several features:
It does not define all the glyphs, but only specifies the coding range to be extended later.
The encoding is variable length, the second byte part is compatible with GBK; the Four-byte section is an expanded glyph, a bit, and its encoding range is the first
byte 0x81-0xfe, two-byte 0x30-0x39, three-byte 0x81-0xfe, four-byte 0x30-0x39.
Its generalization is phased, and first requires that all glyphs that are fully mapped to the Unicode 3.0 standard be Implemented.
It is a national standard and is mandatory.
Now there is no operating system or software to achieve gbk2k support, This is the current stage and the future of the work of the Chinese.
The problem of 3.jsp/servlet Chinese character coding and the solution in was
3.1 Common phenomena of encoding problems
Jsp/servlet encoding problems commonly appearing on the Internet are generally expressed in the browser or application side, such as:
How did the Chinese characters in the Jsp/servlet page seen in the browser become '? '?
How are the Chinese characters in the Servlet pages that are seen in the browser garbled?
How do Chinese characters in the JAVA application interface become squares?
Jsp/servlet Page Cannot display GBK Kanji.
Jsp/servlet cannot receive the Chinese characters submitted by the Form.
Jsp/servlet database Read/write failed to get the correct Content.
Hidden behind these problems are the various wrong character conversions and processing (except for the 3rd, because the Java font setting error
). To solve a similar character encoding problem, you need to understand Jsp/servlet's running process, check for possible problems
Each Point.
3.2 Encoding issues when jsp/servlet Web programming
jsp/servlet, which runs on the Java application server, provides HTML content for Browser, as shown in the following procedure:
Where there are character encoding conversions:
a.jsp Compiled. The Java application server reads the JSP source file according to the JVM's file.encoding value and translates it into an internal
The character encoding is JSP compiled, the JAVA source file is generated, and the file system is written back according to the file.encoding Value. If the current system language
Support GBK, Then there will be no encoding problem at this Time. If the system is in english, such as LANG is en_US Linux,
AIX or Solaris, the file.encoding value of the JVM is set to GBK. If the system language is GB2312, then as needed
, determine if you want to set file.encoding, set File.encoding to GBK to resolve potential GBK character garbled issues

B.java needs to be compiled into A. Class to execute in the JVM, this process exists with A. The same file.encoding question
Problem. From here the servlet and JSP run like this, except that the Servlet's compilation is not Automatic.
C.servlet needs to convert the contents of the HTML page to browser acceptable encoding content to send Out. Depend on the
JAVA App Server implementation, and some will query Browser Accept-charset and Accept-language parameters or
Determine the encoding value in other guesses, or whatever. So constant-encoding may be the best solution.
For Chinese web pages, You can set contenttype= "text/html" in a JSP or Servlet; charset=gb2312 "; if the page
The GBK character is set to Contenttype= "text/html; CHARSET=GBK ", due to the support of IE and Netscape to GBK
is not the same, you need to test it for this setup.
Because 16-bit JAVA char is discarded when the network is transferred, the high 8 bits are also to ensure that the characters in the servlet page (including the embedded
And servlet Run) is the desired inner code and can be replaced with Printwriterōut=res.getwriter ()
Servletoutputstreamōut=res.getoutputstream (), Printerwriter will be based on the contenttype specified in the
CharSet for conversion (contenttype need to be specified before this!) ) or can be packaged in OutputStreamWriter
The Servletoutputstream class uses write (string) to output the Chinese character string.
For Jsp,java application Server, you should be able to ensure that embedded kanji are correctly routed at this Stage.
D. This is the URL character encoding Issue. If the value returned from browser is included in the Get/post method, the Chinese character information
, the servlet will not be able to get the correct Value. In Sun's j2sdk, Httputils.parsename was not tested at all when parsing parameters.
The resulting value is parsed in byte mode, taking into account the browser language setting. This is the most encoding issue on the Internet.
。 Because this is a design flaw, the resulting string can only be re-parsed in bin mode, or in the Hack httputils class
Summary Refer to Article 2, 3 are introduced, but it is best to the Chinese encoding GB2312, CP1381 are changed to GBK, otherwise
When encountering GBK kanji, there are still problems.
Servlet API 2.3 Provides a new function httpserveletrequest.setcharacterencoding for calling the
Request.getparameter ("param_name") before specifying the desired encoding of the application, which will help to completely resolve this
Problem.
WebSphere Application Server extends the standard Servlet API 2.x to provide better multi-lingual support.
The above C,d situation, was all to query the Browser language settings, in the default condition zh, zh-cn, etc. are mapped to JAVA
Encoding CP1381 (note: CP1381 is just equivalent to GB2312 of a codepage, no GBK support). Do this I
Because it is not possible to confirm that the operating system Browser running is supported by GB2312, or GBK, so take it Small. But the actual application
The system still asks the page to appear GBK Chinese characters, The most famous is the "?" in Premier Zhu's Name. (rong2, 0xe946,u9555),
The encoding/charset is sometimes required to be specified as GBK. of course, The change of the default encoding in was didn't say that.
So troublesome, for a, b, refer to article 5), specified in the command line parameters of the application Server-
dfile.encoding=gbk, for d, specified in the command-line arguments of the application Server-
Ddefault.client.encoding=gbk. IF-DDEFAULT.CLIENT.ENCODING=GBK is specified, then the C case can no longer
Specifies Charset.
3.3 Encoding problems when reading and writing databases
Another place where encoding problems often occur in Jsp/servlet programming is the data in the Read-write Database.
The popular relational database system supports database encoding, which means that its own characters can be specified when the database is created
set, the data for the database is stored in the specified encoding Format. When the application accesses the data, it will be available at both the entry and exit Points.
Encoding Conversion. For Chinese data, The integrity of the data should be Ensured. gb2312,gbk,utf-8, etc. are optional databases
encoding, if Iso8859-1 (8-bit SBCS) is selected, the application must write the data before writing a 16Bit character or
Unicode is split into two 8-bit characters, and after reading the data it is necessary to combine two bytes, as well as to distinguish the SBCS
Character. Instead of taking full advantage of database encoding, the complexity of programming is increased, iso8859-1 is not the recommended data
Library Encoding. Jsp/servlet programming, You can first use the functions provided by the database management system to check the correct Chinese data

It should then be noted that the Encoding,java program of the data being read is generally Unicode. When writing the data, the phase
Anti -.
3.4 Common techniques for locating problems
Locating Chinese encoding problems are usually the stupidest and most effective way to print a word after you think a suspect program has been processed.
The inner code of the character String. By printing the inner code of the string, you can find out when the Chinese characters are converted to unicode, when
Unicode is returned to the Chinese code, when a Chinese text into two Unicode characters, when the string was converted into a
A series of question marks, when the Chinese string of high-level was truncated ...
Taking the appropriate sample string also helps to differentiate between types of problems. Such as: "aa Ah aa?aa" and other Chinese and english, GB, GBK characteristics
Characters are all strings. In general, no matter how the English characters are converted or processed, it will not be distorted (if encountered, you can try
Increase the length of consecutive letters in english).
4. Concluding remarks
In fact, jsp/servlet Chinese encoding is not as complex as imagined, although the positioning and problem solving is not confine, various
The operating environment is not necessarily the same, but the principle behind is the Same. Understanding Character Set knowledge is the basis for solving character problems. however, as
The change of Chinese character set is not only Java programming, but also the problem of it will exist for some Time.
5. Reference Articles
1) Character Problem Review
2) analysis and solution of Chinese character problem in Java programming technology
3) NLS characters in Websphere:sbcs/dbcs display on same page
4) GB18030
5) Setting language encoding in Web Applications:websphere applications Server

Previously written, migrated to this

Original Link: http://user.qzone.qq.com/372806800/blog/1336199467

The Chinese character coding problem in Jsp/servlet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.