Js|servlet| Code | Chinese Character | question
There are a number of excellent articles and discussions on the internet about DBCS character encoding in Jsp/servlet, and this article makes some collation of them, and makes some explanations with IBM WebSphere application Server 3.5 (WAS) solution, hoping it is not redundant.
Content:
The origin of the problem
gb2312-80,gbk,gb18030-2000 Character Set and Encoding
The origin of ´?´ and garbled code when Chinese transcoding
Jsp/servlet encoding problem and its solution in was
Conclusion
Reference articles
1. The origins of the problem
Each country (or region) prescribes a set of character encodings for computer information interchange, such as extended ASCII in the United States, Chinese gb2312-80, JIS of Japan, etc., which is the basis of information processing in the country/region, and has the important role of unified coding. The character encoding set is divided into SBCS (Single-byte character set) by length, DBCS (double-byte character set) two broad categories. Early software (especially the operating system), in order to solve the local character information computer processing, the emergence of various localized versions (L10N), in order to distinguish between the introduction of LANG, Codepage and other concepts. However, because of the overlapping of the local character set code scope, the information exchange between each other is difficult, each localized version of the software has higher cost of independent maintenance. Therefore, it is necessary to extract the commonality from the localization work, and make a consistent treatment to minimize the specific localized processing. This is also called internationalization (i18n). Various language information is further normalized to Locale information. The underlying character set that is processed becomes Unicode, which contains almost all glyphs.
Now most of the software core character processing with internationalized features is based on Unicode, the local character encoding setting is determined according to the Locale/lang/codepage setting at the time of software running, and the local characters are processed accordingly. There is a need to implement conversion between Unicode and the local character set during processing, or even between two different local character sets in Unicode. This approach is further extended under the network environment, and the character information at both ends of the network needs to be converted to acceptable content according to the set of character set.
The interior of the Java language is a Unicode representation of characters, followed by Unicode V2.0. The Java program has character-coded conversions whether it reads/writes files from/to a file system, writes HTML information to a URL, or reads parameter values from a URL connection. This increases the complexity of programming and leads to confusion, but it is in line with international thinking.
In theory, these character conversions based on character set settings should not cause too many problems. The fact is that because of the actual operating environment of the application, Unicode and the local character set complement, perfect, and the system or application implementation of the nonstandard, the problem of transcoding is always bothering programmers and users.
2. gb2312-80,gbk,gb18030-2000 Character Set and Encoding
In fact, the solution to the encoding problem in JAVA programs is often very simple, but understand the reasons behind it, positioning problems, but also need to understand the existing encoding and coding conversion.
GB2312-80 is developed at the initial stage of the development of Chinese computer Information technology, which contains most commonly used secondary characters and 9-area symbols. The character set is the Chinese character set supported by almost all Chinese systems and internationalized software, which is also the most basic Chinese character set. Its coding range is high 0xa1-0xfe, Low is also 0xa1-0xfe; Chinese characters start from 0xb0a1 and end in 0xf7fe;
GBK is an extension of gb2312-80 and is up-compatible. It contains 20,902 Chinese characters, and its coding range is 0x8140-0xfefe, which eliminates the position of high 0x80. All of its characters can be mapped one-to-one to Unicode 2.0, which means that JAVA actually provides support for the GBK character set. This is the default character set for Windows and some other Chinese operating systems at this stage, but not all internationalized software supports the character set, and it feels like they don't fully know what's going on with GBK. It is noteworthy that it is not a national standard, but only norms. With the release of gb18030-2000 GB, it will complete its historical mission in the near future.
gb18030-2000 (GBK2K) further expands the Chinese characters on the basis of GBK, and increases the glyphs of Tibetan and Mongolian minorities. GBK2K fundamentally solves the problem of insufficient character and short shape. It has several features,
It does not determine all glyphs, but only defines the coding range, which is left for later expansion.
The encoding is variable length, the second byte part is compatible with GBK, the four-byte part is the expanded glyph, the word bit, its encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, four byte 0x30-0x39.
Its generalization is staged and requires that all glyphs that are fully mapped to the Unicode 3.0 standard be implemented first.
It is the national standard, is mandatory.
There is no operating system or software to achieve the GBK2K support, this is the current stage and the future of the work content.
Introduction to Unicode ... Just let it go.
The JAVA-supported encoding is related to Chinese programming: (several are not listed in the JDK documentation)
ASCII 7-bit, with Ascii7
Iso8859-1 8-bit, with 8859_1,iso-8859-1,iso_8859-1,latin1 ...
Gb2312-80 with gb2312,gb2312-1980,euc_cn,euccn,1381,cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB ...
GBK (note case), with MS936
UTF8 UTF-8
GB18030 (now only IBM JDK1.3.?) have support), with cp1392,1392
The JAVA language uses Unicode processing characters. But from another point of view, in the Java program can also use non-Unicode transcoding, it is important to ensure that the program entrance and export of Chinese character information is not distorted. If the full use of iso-8859-1 to deal with Chinese characters can also achieve the correct results. Many of the solutions that are prevalent on the web fall into this category. In order not to cause confusion, this article does not discuss this method.
3. The origin of ´?´ and garbled characters in Chinese transcoding
Two directional transformations are likely to get the wrong result:
Unicode-->byte, if the target code set does not have the corresponding code, the result is 0x3f.
Such as:
The result of "u00d6u00ecu00e9u0046u00bbu00f9". GetBytes ("GBK") is "? Ìéf?ù" and Hex value is 3fa8aca8a6463fa8b4.
Take a closer look at the results above and you will find that U00EC is converted to 0xa8ac and U00e9 is converted to xa8a6 ... Its actual effective bit is getting longer! This is because some of the symbols in the GB2312 symbol area are mapped to some common symbolic encodings, because they appear in the iso-8859-1 or some other SBCS character sets, so they are encoded in Unicode in the first place, some of them have a valid bit of only 8 digits, and the encoding of Chinese characters overlaps ( In fact, this mapping is just coded mapping, in the display of careful is not the same. The symbol in Unicode is single-byte, and the symbol in the Chinese character is double byte wide. There are 20 such symbols between the unicodeu00a0--u00ff. It is very important to understand this feature! It is not difficult to understand why Java programming, encoding error results often appear some garbled (in fact, symbolic characters), rather than all ´?´ characters, such as the above example.
Byte-->unicode, if the byte identity character does not exist in the source code set, the result is 0xfffd.
Such as:
byte ba[] = {(byte) 0x81, (Byte) 0x40, (Byte) 0xb0, (byte) 0xa1}; New String (BA, "gb2312");
The result is "Ah," and the hex value is "ufffdu554a". 0x8140 is a GBK character, the GB2312 conversion table does not have the corresponding value, take ufffd. (Note: When this Unicode is displayed, because there is no corresponding local character, the previous case is also displayed as a "?".)
In the actual programming, the Jsp/servlet program gets the wrong Chinese character information, which is often the superposition of these two processes, sometimes even the result of two processes superimposed and repeated effects.
4. Jsp/servlet encoding problems and solutions in was
4.1 Common phenomena of encoding problems
Jsp/servlet encoding problems that often appear on the web are generally expressed in browser or application-side, such as:
How does the Chinese character in the Jsp/servlet page that is seen in the browser become '? '?
How do the Chinese characters in the Servlet page you see in the browser become garbled?
How can the Chinese characters in the JAVA application interface become squares?
The Jsp/servlet page cannot display GBK characters.
JSP page embedded in the <%...%>,<%=...%> tag contained in the JAVA code in the Chinese into garbled, but the page of the other Chinese characters are right.
Jsp/servlet cannot receive the Chinese characters submitted by form.
Jsp/servlet database reading and writing does not get the correct content.
Hidden behind these problems are the various wrong character conversions and processing (except for the 3rd, because of Java font setting errors). To solve a similar character encoding problem, you need to understand the Jsp/servlet run process and examine the points where the problem may occur.
Encoding problems in 4.2 jsp/servlet Web programming
Jsp/servlet, which runs on the Java application Server, provides HTML content for Browser, as shown in the following illustration:
Where there are character encoding conversions:
JSP compilation. The Java application Server reads the JSP source file based on the JVM's file.encoding value, compiles the Java source file, and writes back the file system based on the file.encoding value. If the current system language supports GBK, then there is no encoding problem. If an English-language system, such as LANG is en_US Linux, AIX, or Solaris, then set the JVM's file.encoding value to GBK. System language If it is GB2312, if necessary, determine whether to set file.encoding, file.encoding set to GBK can solve the potential GBK character garbled problem
Java needs to be compiled to. class to execute in the JVM, which has the same file.encoding problem as a. Starting here the servlet and JSP are running like this, except that the servlet compilation is not automatic. For JSP programs, the compilation of the resulting Java intermediate files is done automatically (call the Sun.tools.javac.Main class directly in the program). So if there is a problem at this point, check the locale of the encoding and OS, or convert the static kanji embedded in JSP Java code to Unicode, or the static text output not in JAVA code. For the servlet, it is javac to manually specify the-encoding parameter at compile time.
The Servlet needs to convert HTML page content to browser acceptable encoding content to send out. Depending on how each JAVA App Server is implemented, some will query the Accept-charset and accept-language parameters of Browser, or otherwise guess encoding values, and others regardless. Therefore, the adoption of fixed encoding may be the best solution. For Chinese web pages, you can set contenttype= "text/html" in the JSP or Servlet; charset=gb2312 "; If there are GBK characters in the page, set to Contenttype=" text/html; CHARSET=GBK ", because IE and Netscape support the level of GBK is not the same, this setting needs to be tested.
Because a 16-bit JAVA char is discarded when it is sent over the network, and in order to ensure that the Chinese characters in the servlet page (including the embedded and servlet-run process) are expected to be coded, you can replace the PrintWriter out=res.getwriter () Servletoutputstream Out=res.getoutputstream (). Printerwriter will be converted according to the charset specified in ContentType (contenttype need to be specified before this!) ); You can also use OutputStreamWriter to encapsulate the Servletoutputstream class and output character strings with write (string).
The Jsp,java application Server should be able to ensure that embedded Chinese characters are delivered correctly at this stage.
This is an explanation of the URL character encoding problem. If you include kanji information in the parameter values returned from browser by Get/post, the servlet will not get the correct value. In Sun's j2sdk, Httputils.parsename does not consider the browser language setting at all when parsing parameters, but instead parses the resulting value by byte. This is the encoding problem that is most discussed on the Internet. Because this is a design flaw, the resulting string can only be parsed in bin mode, or resolved in the form of Hack Httputils class. Reference article 2 are introduced, but it is best to Chinese encoding GB2312, CP1381 are changed to GBK, otherwise encountered GBK Chinese characters, there will be problems.
Servlet API 2.3 Provides a new function httpserveletrequest.setcharacterencoding to specify the application's desired before calling Request.getparameter ("Param_name") Encoding, this will help to solve this problem thoroughly.
4.3 Solutions in IBM Websphere application Server
WebSphere Application Server extends the standard Servlet API 2.x to provide better multilanguage support. Running in the Chinese operating system, you can handle Chinese characters well without making any settings. The following instructions are valid only if it is a system that is running in English or requires GBK support.
The above c,d situation, was all to query Browser language settings, in the default, EN, ZH-CN, etc. are mapped to the JAVA encoding CP1381 (Note: CP1381 is only equivalent to the GB2312 of a codepage, no GBK branch Hold). I think it's because I can't confirm Browser running the operating system is supported GB2312 or GBK, so take it small. However, the actual application system is still required to appear in the page GBK Chinese characters, the most famous is the name of Prime Minister Zhu "?" (Rong2, 0xe946,u9555), so it is sometimes necessary to designate Encoding/charset as GBK. Of course was, changing the default encoding is not as troublesome as the above, for A,B, refer to Article 5, specify-DFILE.ENCODING=GBK in the command-line arguments application Server; for D, in application The-DDEFAULT.CLIENT.ENCODING=GBK is specified in the Server's command-line arguments. If-DDEFAULT.CLIENT.ENCODING=GBK is specified, CharSet can no longer be specified in the case of C.
Among the issues listed above is a question about the static text contained in the JAVA code in tag<%...%>,<%=...%> that is not properly displayed, and the workaround in was is to set the correct file.encoding, You also need to set the-DUSER.LANGUAGE=ZH-DUSER.REGION=CN in the same way. This is related to the settings of the Java locale.
The encoding problem of 4.4 database when reading and writing
Another place where encoding problems often occur in Jsp/servlet programming is to read and write data from the database.
The popular relational database system supports database encoding, which means that when you create a database, you can specify its own character set settings, and the database data is stored in the specified encoding form. When an application accesses data, there are encoding conversions at both the entrance and exit points. For Chinese data, the database character encoding should be set to ensure the integrity of the data. Gb2312,gbk,utf-8 are optional database encoding or you can choose Iso8859-1 (8-bit), the application must split a 16Bit character or Unicode into two 8-bit characters before writing the data. After reading the data, you need to combine two bytes, and also identify the SBCS characters. Instead of taking full advantage of the role of database encoding, the complexity of programming is increased, iso8859-1 is not the recommended database encoding. Jsp/servlet programming, you can first use the database management system provided by the management function to check whether the Chinese data is correct.
It should then be noted that the Encoding,java program in which the data is read is generally Unicode. The opposite is true when writing data.
4.5 Common techniques for locating problems
The most stupid and most effective way to locate a Chinese encoding problem is to use it. Print the inner code of the string after you think the program is processed. By printing the inner code of the string, you can find out when Chinese characters are converted to Unicode, when Unicode is turned back into Chinese, and when one of the two Unicode characters is converted into a string of question marks, When was the high order of Chinese strings truncated ...
Taking the appropriate sample string also helps to distinguish between the types of problems. such as: "AA ah Aa?aa" and other Chinese and English, GB, GBK character characters are all strings. In general, no matter how the English character is converted or processed, it does not distort (if you do, you can try to increase the length of consecutive letters).
5. Concluding remarks
In fact, Jsp/servlet Chinese encoding is not as complex as imagined, although the positioning and problem-solving is not fixed, various operating environment is not necessarily, but the principle behind is the same. Understanding the character set's knowledge is the basis for solving character problems. However, with the change of Chinese character set, not only Java programming, the problem in China's processing will still exist for some time.