Js|servlet| Code | Chinese Character | Problem online There are a number of excellent articles and discussions on the jsp/servlet of DBCS character encoding in this article, which are sorted out and combined with IBM WebSphere application Server 3.5 (was) 's solution makes some notes, hoping it's not superfluous.
1. The origins of the problem
Each country (or region) prescribes a set of character codes for the exchange of computer information, such as ASCII in the United States, gb2312-80 in China, JIS in Japan, etc., as the basis for information processing in the country/region, with the important role of unified coding. The character encoding set is divided into SBCS (Single-byte character set) by length, DBCS (double-byte character set) two broad categories. Early software (especially the operating system), in order to solve the local character information computer processing, the emergence of a variety of localized versions (L10N), in order to distinguish between the introduction of Lang,codepage concepts. However, because of the overlapping of the local character set code scope, the information exchange between each other is difficult, each localized version of the software has higher cost of independent maintenance. Therefore, it is necessary to extract the commonality from the localization work, and make a consistent treatment to minimize the specific localized processing. This is also called internationalization (i18n). Various language information is further normalized to Locale information. The underlying character set that is processed becomes Unicode, which contains almost all glyphs.
Now most of the software core character processing with internationalized features is based on Unicode, the local character encoding setting is determined according to the Locale/lang/codepage setting at the time of software running, and the local characters are processed accordingly. There is a need to implement conversion between Unicode and the local character set during processing, or even between two different local character sets in Unicode. This approach is further extended under the network environment, and the character information at both ends of the network needs to be converted to acceptable content according to the set of character set.
The interior of the Java language is a Unicode representation of characters, followed by Unicode V2.0. The Java program has character-coded conversions whether it reads/writes files from/to a file system, writes HTML information to a URL, or reads parameter values from a URL connection. This increases the complexity of programming and leads to confusion, but it is in line with international thinking.
In theory, these character conversions based on character set settings should not cause too many problems. The fact is that because of the actual operating environment of the application, Unicode and the local character set complement, perfect, and the system or application implementation of the nonstandard, the problem of transcoding is always bothering programmers and users.
2.gb2312-80,gbk,gb18030-2000 Chinese Character Set
In fact, the solution to the encoding problem in JAVA programs is often very simple, but understand the reasons behind it, positioning problems, but also need to understand the existing encoding and coding conversion.
GB2312-80 is developed at the initial stage of the development of Chinese computer Information technology, which contains most commonly used secondary characters and 9-area symbols. The character set is the Chinese character set supported by almost all Chinese systems and internationalized software, which is also the most basic Chinese character set. Its coding range is high 0xa1-0xfe, Low is also 0xa1-0xfe; Chinese characters start from 0xb0a1 and end in 0xf7fe;
GBK is an extension of gb2312-80 and is up-compatible. It contains 20,902 Chinese characters, and its coding range is 0x8140-0xfefe, which eliminates the position of high 0x80. All of its characters can be mapped one-to-one to Unicode 2.0, which means that JAVA actually provides support for the GBK character set. This is the default character set for Windows and some other Chinese operating systems at this stage, but not all internationalized software supports the character set, and it feels like they don't fully know what's going on with GBK. It is noteworthy that it is not a national standard, but only norms. With the release of gb18030-2000 GB, it will complete its historical mission in the near future.
gb18030-2000 (GBK2K) further expands the Chinese characters on the basis of GBK, and increases the glyphs of Tibetan and Mongolian minorities. GBK2K fundamentally solves the problem of insufficient character and short shape. It has several features:
It does not determine all glyphs, but only defines the coding range, which is left for later expansion.
The encoding is variable length, the second byte part is compatible with GBK, the four-byte part is the expanded glyph, the word bit, its encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, four byte 0x30-0x39.
Its generalization is staged and requires that all glyphs that are fully mapped to the Unicode 3.0 standard be implemented first.
It is the national standard, is mandatory.
There is no operating system or software to achieve the GBK2K support, this is the current stage and the future of the work content.
3.jsp/servlet encoding problem and its solution in was
3.1 Common phenomena of encoding problems
Jsp/servlet encoding problems that often appear on the web are generally expressed in browser or application-side, such as:
How does the Chinese character in the Jsp/servlet page that is seen in the browser become '? '?
How do the Chinese characters in the Servlet page you see in the browser become garbled?
How can the Chinese characters in the JAVA application interface become squares?
The Jsp/servlet page cannot display GBK characters.
Jsp/servlet cannot receive the Chinese characters submitted by form.
Jsp/servlet database reading and writing does not get the correct content.
Hidden behind these problems are the various wrong character conversions and processing (except for the 3rd, because of Java font setting errors). To solve a similar character encoding problem, you need to understand the Jsp/servlet run process and examine the points where the problem may occur.
Encoding problems in 3.2 jsp/servlet Web programming
Jsp/servlet, which runs on the Java application Server, provides HTML content for Browser, as shown in the following illustration:
Where there are character encoding conversions:
A.jsp compiled. The Java application Server reads the JSP source file according to the JVM's file.encoding value and converts it to the internal character encoding for JSP compilation, generates Java source files, and writes back the file system according to the file.encoding value. If the current system language supports GBK, then there is no encoding problem. If an English-language system, such as LANG is en_US Linux, AIX, or Solaris, then set the JVM's file.encoding value to GBK. System language If it is GB2312, if necessary, determine whether to set file.encoding, file.encoding set to GBK can solve the potential GBK character garbled problem.
B.java needs to be compiled to. class to execute in the JVM, which has the same file.encoding problem as a. Starting here the servlet and JSP are running like this, except that the servlet compilation is not automatic.
C.servlet needs to convert HTML page content to browser acceptable encoding content to send out. Depending on how each JAVA App Server is implemented, some will query the Accept-charset and accept-language parameters of Browser, or otherwise guess encoding values, and others regardless. So constant-encoding may be the best solution. For Chinese web pages, you can set contenttype= "text/html" in the JSP or Servlet; charset=gb2312 "; If there are GBK characters in the page, set to Contenttype=" text/html; CHARSET=GBK ", because IE and Netscape support the level of GBK is not the same, this setting needs to be tested.
Because a 16-bit JAVA char is discarded when it is sent over the network, and in order to ensure that the Chinese characters in the servlet page (including the embedded and servlet-run process) are expected to be coded, you can replace the PrintWriter out=res.getwriter () Servletoutputstream Out=res.getoutputstream (), Printerwriter will be converted according to CharSet specified in ContentType (contenttype must be specified before this!). Or you can use OutputStreamWriter to encapsulate the Servletoutputstream class and output a character string with write (string).
The Jsp,java application Server should be able to ensure that embedded Chinese characters are delivered correctly at this stage.
D. This is a URL character encoding problem. If the value returned from the browser is contained in get/post mode, the servlet will not get the correct value. In Sun's j2sdk, Httputils.parsename does not consider the browser language setting at all when parsing parameters, but instead parses the resulting value by byte. This is the encoding problem that is most discussed on the Internet. Because this is a design flaw, the resulting string can only be parsed in bin mode, or resolved in the form of Hack Httputils class. Reference articles 2, 3 are introduced, but it is best to the Chinese encoding GB2312, CP1381 are changed to GBK, otherwise encountered GBK Chinese characters, there will be problems.
Servlet API 2.3 Provides a new function httpserveletrequest.setcharacterencoding to specify the application's desired before calling Request.getparameter ("Param_name") Encoding, this will help to solve this problem thoroughly.
WebSphere Application Server extends the standard Servlet API 2.x to provide better multilanguage support. The above c,d situation, was all to query Browser language settings, in the default, EN, ZH-CN, etc. are mapped to JAVA encoding CP1381 (Note: CP1381 is only equivalent to a GB2312 codepage, no GBK support). I think it's because I can't confirm Browser running the operating system is supported GB2312 or GBK, so take it small. However, the actual application system is still required to appear in the page GBK Chinese characters, the most famous is the name of Prime Minister Zhu "?" (Rong2, 0xe946,\u9555), so it is sometimes necessary to designate Encoding/charset as GBK. Of course was, changing the default encoding is not as troublesome as the above, for A,B, refer to Article 5, specify-DFILE.ENCODING=GBK in the command-line arguments application Server; for D, in Applicatio n Server, specify-DDEFAULT.CLIENT.ENCODING=GBK in the command line arguments. If-DDEFAULT.CLIENT.ENCODING=GBK is specified, CharSet can no longer be specified in the case of C.
The encoding problem of 3.3 database when reading and writing
Another place where encoding problems often occur in Jsp/servlet programming is to read and write data from the database.
The popular relational database system supports database encoding, which means that when you create a database, you can specify its own character set settings, and the database data is stored in the specified encoding form. When an application accesses data, there are encoding conversions at both the entrance and exit points. For Chinese data, the integrity of the data should be ensured. Gb2312,gbk,utf-8, etc. are optional database encoding; If you select Iso8859-1 (8-bit SBCS), the application must split a 16Bit character or Unicode into two 8-bit characters before writing the data. After reading the data, you need to combine two bytes, as well as identify the SBCS characters. Instead of taking full advantage of the role of database encoding, the complexity of programming is increased, iso8859-1 is not the recommended database encoding. Jsp/servlet programming, you can use the database management system provided by the function to check whether the Chinese data is correct.
It should then be noted that the Encoding,java program in which the data is read is generally Unicode. The opposite is true when writing data.
3.4 Common techniques for locating problems
The most stupid and effective way to locate a Chinese encoding problem is to print the inner code of the string after the program you think is suspected. By printing the inner code of the string, you can find out when Chinese characters are converted to Unicode, when Unicode is turned back into Chinese, and when one of the two Unicode characters is converted into a string of question marks, When was the high order of Chinese strings truncated ...
Taking the appropriate sample string also helps to distinguish between the types of problems. such as: "AA ah Aa?aa" and other Chinese and English, GB, GBK character characters are all strings. In general, no matter how the English character is converted or processed, it does not distort (if you do, you can try to increase the length of consecutive letters).
4. Concluding remarks
In fact, Jsp/servlet Chinese encoding is not as complex as imagined, although the positioning and problem-solving is not fixed, various operating environment is not necessarily, but the principle behind is the same. Understanding the character set's knowledge is the basis for solving character problems. However, with the change of Chinese character set, not only Java programming, the problem in China's processing will still exist for some time.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.