Question about Chinese character encoding in JSP/Servlet
There are many excellent articles and discussions on the issue of DBCS character encoding in JSP/Servlet on the Internet. This article will organize them and combine them with IBM WebSphere Application Server 3.5 (WAS) in the hope that it is not redundant.
Content:
Origin
?????? -80, GBK, GB18030-2000 Chinese Character Set and Encoding
During Chinese transcoding '? '. Sources of garbled characters
JSP/Servlet Chinese character encoding and solutions in WAS
Conclusion
References
1. Origin of the problem
Each country (or region) specifies the character delimiter set used for computer information exchange, such as the expanded ASCII code of the United States, the Chinese ?????? -80: JIS of Japan, which serves as the basis for information processing in the country/region, plays an important role in unified coding. The character Collation is divided into SBCS (single-byte character set) and DBCS (dubyte Character Set) by length. Early software (especially the operating system), in order to solve the computer processing of local character information, various local versions (L10N) were introduced. to distinguish, LANG, Codepage and other concepts were introduced. However, the Code ranges of local character sets overlap, making it difficult to exchange information between them. The independent maintenance costs of each localized version of the software are high. Therefore, it is necessary to extract the commonalities in the localization work for consistent processing, so as to minimize the content of special localization processing. This is also called I18N ). The language information is further standardized as Locale information. The underlying character set to be processed becomes Unicode that contains almost all glyphs.
Currently, most of the software's core Character Processing Systems with internationalization features are Unicode-based. During software running, the corresponding local character encoding settings are determined based on the Locale/Lang/Codepage settings at that time, and handle local characters accordingly. In the process, Unicode and local character sets must be converted to each other, or two different local character sets with Unicode as the center must be converted to each other. This method is further extended in the network environment. The character information at both ends of any network needs to be converted to acceptable content according to the character set settings.
The Java language uses Unicode to represent characters and complies with Unicode V2.0. Java programs can convert character codes to read/write files in streams from/to the file system, write HTML information to URL connections, or read parameter values from URL connections. Although this method increases programming complexity and can cause confusion, it is in line with the idea of internationalization.
Theoretically, character Conversion Based on Character Set settings should not cause too many problems. The fact is that the actual running environment of applications is different. Unicode is supplemented and improved with local character sets, and the implementation of systems or applications is not standardized, the problems encountered during transcoding have always plagued programmers and users.
2 .?????? -80, GBK, GB18030-2000 Chinese Character Set and Encoding
In fact, the method to solve the problem of Chinese character encoding in JAVA programs is often very simple, but to understand the reasons behind it, to locate the problem, you also need to understand the existing Chinese character encoding and encoding conversion.
?????? -80 is developed at the initial stage of the development of Chinese character information technology in China. It contains most of the commonly used first-and second-level Chinese characters and symbols in Zone 9. This character set is supported by almost all Chinese systems and international software. It is also the most basic Chinese character set. The encoding range is high: 0xa1-0xfe; low: 0xa1-0xfe; Chinese characters start from 0xb0a1 and end with 0xf7fe;
GBK is ?????? -80 extensions are compatible with the upstream. It contains 20902 Chinese characters and Its Encoding range is 0x8140-0xfefe, excluding the characters with a high position of 0x80. All its characters can be mapped to Unicode 2.0 one-to-one. That is to say, JAVA actually supports the GBK character set. This is the default character set for Windows and some other Chinese operating systems at present, but not all international software support this character set. It seems that they do not fully understand what GBK is. It is worth noting that it is not a national standard, but a standard. With the launch of the GB18030-2000 national mark, it will fulfill its historical mission in the near future.
On the basis of GBK, GB18030-2000 (GBK2K) further expands Chinese characters and adds the fonts of Tibetan and Mongolian ethnic minorities. GBK2K fundamentally solves the problem of insufficient characters and insufficient fonts. It has several features,
It does not determine all the glyphs, but only specifies the encoding range, which will be extended later.
The encoding is variable, and the second part is compatible with GBK. The four-byte part is the expanded font and character bit, the encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, and four byte 0x30-0x39.
It is promoted in stages. The first requirement is that it can be fully mapped to all fonts of the Unicode 3.0 standard.
It is a national standard and mandatory.
At present, no operating system or software has supported GBK2K. This is the work of current and future localization.
Unicode introduction.
The encoding supported by JAVA is related to Chinese programming: (several of them are not listed in the JDK Documentation)
ASCII 7-bit, same as ascii7
ISO8859-1 8-bit, with 8859_1, ISO-8859-1, ISO_8859-1, latin1...
?????? -80 is the same ??????,?????? -1980, EUC_CN, euccn, 1381, Cp1381, 1383, Cp1383, ISO2022CN, ISO2022CN_GB ......
GBK (case sensitive), same as MS936
UTF8 UTF-8
GB18030 (only IBM JDK1.3 .? Supported), same as Cp1392 and 1392
The JAVA language uses Unicode to process characters. But from another perspective, non-Unicode transcoding can also be used in java programs. It is important to ensure that the Chinese character information at the program entry and exit is not distorted. If the use of ISO-8859-1 to deal with Chinese characters can also achieve the correct results. Many popular solutions on the Internet belong to this type. This method is not discussed in this article to avoid confusion.
3. During Chinese transcoding '? '. Sources of garbled characters
Both directions may have incorrect results:
Unicode --> Byte. If the target code set does not contain the corresponding code, the result is 0x3f.
For example:
"\ U00d6 \ u00ec \ u00e9 \ u0046 \ u00bb \ u00f9". What is the result of getBytes ("GBK "? Ì é F? The Hex value is 3fa8aca8a6463fa8b4.
Take a closer look at the above results, you will find that \ u00ec is converted to 0xa8ac, \ u00e9 is converted to \ xa8a6... its actual effective bit is getting longer! This is because ?????? Some symbols in the symbol area are mapped to some common symbol encodings, because these symbols appear in the ISO-8859-1 or some other SBCS character sets, they are more advanced in Unicode encoding, some of its valid bits only have 8 bits and overlap with the encoding of Chinese characters (in fact, this ing is only the encoding ing, and it is not the same when displaying it carefully. The symbols in Unicode are single-byte width, and the characters in Chinese characters are double-byte width. There are 20 such symbols between Unicode \ u00a0 -- \ u00ff. It is very important to understand this feature! It is not difficult to understand why Chinese character encoding Errors often result in garbled characters (actually symbolic characters) in JAVA programming, not all of which are '? 'Character, such as the above example.
Byte --> Unicode. If the character identified by Byte does not exist in the source code set, the result is 0xfffd.
For example:
Byte ba [] = {(byte) 0x81, (byte) 0x40, (byte) 0xb0, (byte) 0xa1}; new String (ba, "?????? ");
The result is "? Ah ", the hex value is" \ ufffd \ u554a ". 0x8140 is a GBK character. Press ?????? The conversion table does not have the corresponding value. \ ufffd is used. (Note: When this uniCode is displayed, the corresponding local character does not exist, so the previous situation is also applicable. It is displayed as "? ".)
In actual programming, JSP/Servlet programs often obtain incorrect Chinese character information, which is the superposition of these two processes, sometimes even the result of repeated operations after the superposition of the two processes.
4. JSP/Servlet Chinese character encoding Problems and Solutions in WAS
4.1 common encoding Problems
Common JSP/Servlet encoding Problems on the Internet are generally manifested in browser or application, such:
Why are Chinese characters on the Jsp/Servlet page displayed in the browser '? '?
Why are Chinese characters on the Servlet page displayed in the browser garbled?
How do Chinese characters in the JAVA application interface become blocks?
The Jsp/Servlet page cannot display GBK Chinese characters.
The JSP page is embedded in <%... %>, <% =... %> Chinese characters in the JAVA code contained in tags are garbled, but other Chinese characters on the page are correct.
Jsp/Servlet cannot receive Chinese characters submitted by form.
The correct content cannot be obtained for reading and writing JSP/Servlet databases.
What hides behind these problems is character conversion and processing of various errors (except 3rd, due to errors in Java font settings ). To solve the problem of similar character encoding, you need to understand the running process of Jsp/Servlet and check the various points that may cause problems.
4.2 encoding in JSP/Servlet web Programming
The JSP/Servlet running on the Java application server provides HTML content for Browser, as shown in the process:
The conversion of character encoding is as follows:
JSP compilation. The Java application server will read the JSP Source file Based on the JVM file. encoding value, compile and generate the JAVA source file, and then write it back to the file system based on the file. encoding value. If the current system language supports GBK, there will be no encoding Problems. For an English system, for example, if LANG is an en_US Linux, AIX, or Solaris, set the JVM file. encoding value to GBK. If the system language is ??????, Determine whether to set file. encoding as needed. Setting file. encoding to GBK can solve the potential problem of garbled GBK characters.
Java must be compiled into. class before it can be executed in JVM. This process has the same file. encoding issue as. Starting from here, servlet and jsp are similar, but Servlet compilation is not automatically performed. For JSP programs, compilation of the generated JAVA intermediate files is automatically performed (sun is called directly in the program. tools. javac. main class ). therefore, if a problem occurs at this step, check the language environment of encoding and OS, or convert static Chinese characters embedded in JSP JAVA Code to Unicode, either static text output should not be placed in JAVA code. For Servlets, You can manually specify the-encoding parameter during javac compilation.
Servlet needs to convert the HTML page content to an acceptable encoding content in browser and send it out. Depending on the implementation methods of each JAVA App Server, some will query the accept-charset and accept-language parameters of Browser or determine the encoding value by other guesses, and some will ignore it. Therefore, using fixed encoding may be the best solution. For Chinese Web pages, you can set contentType = "text/html; charset = ?????? "; If the page contains GBK characters, set it to contentType =" text/html; charset = GBK ". Because IE and Netscape have different levels of support for GBK, test this setting.
Because 16-bit JAVA char is discarded when transmitted over the network, the characters on the Servlet page (including embedded and obtained during servlet running) are expected inner codes, printWriter out = res. getWriter () replaces ServletOutputStream out = res. getOutputStream (). printerWriter converts data based on the charset specified in contentType (ContentType must be specified before this !); You can also use OutputStreamWriter to encapsulate the ServletOutputStream class and Use write (String) to output Chinese character strings.
For JSP, the JAVA Application Server should be able to ensure that the embedded Chinese characters are correctly transmitted at this stage.
This is an explanation of the question about the URL character encoding. If the parameter value returned by the browser contains Chinese characters through the get/post method, the servlet cannot obtain the correct value. In SUN's J2SDK, HttpUtils. parseName does not consider the browser language settings when parsing parameters, but parses the obtained values in byte mode. This is the most talked about encoding on the Internet. Because this is a design defect, you can only re-parse the string in the bin mode, or solve it in the hack HttpUtils class. For more information, see article 2. However, we recommend that you use the Chinese encoding ?????? And CP1381 are changed to GBK. Otherwise, there will still be problems in the case of GBK Chinese characters.
Servlet API 2.3 provides a new function HttpServeletRequest. setCharacterEncoding, which is used to specify the encoding that the application wants before calling request. getParameter ("param_name.
4.3 solutions in IBM Websphere Application Server
WebSphere Application Server extends the standard Servlet API 2.x to provide better multi-language support. Running in a Chinese operating system can process Chinese characters without any settings. The following description is only valid when WAS is running in an English system or GBK support is required.
In the above c and d cases, WAS must query the Browser language settings. By default, zh and zh-cn are all mapped to JAVA encoding CP1381 (note: CP1381 is equivalent ?????? A codepage, not supported by GBK ). I think this is because the operating system running Browser cannot be confirmed to support ??????, Or GBK, so it is small. However, the actual application system still requires GBK Chinese characters to appear on the page. The most famous is the name of Premier Zhu, "Jun" (rong2, 0xe946, \ u9555 ), therefore, you must specify Encoding/Charset as GBK. Of course, changing the default encoding in WAS is not as troublesome as described above. For a and B, refer to Article 5. Specify-Dfile in the command line parameters of Application Server. encoding = GBK. For d, specify-Ddefault in the command line parameter of Application Server. client. encoding = GBK. If-Ddefault. client. encoding = GBK is specified, charset can be no longer specified in c.
The question listed above also has a question about Tag <%... %>, <% =... the static text contained in JAVA code in %> cannot be correctly displayed. The solution in WAS is to set the correct file. encoding, you also need to set-Duser in the same way. language = zh-Duser. region = CN. This is related to the setting of JAVA locale.
4.4 encoding during database read/write
In JSP/Servlet programming, the encoding problem often occurs. Another issue is reading and writing data in the database.
Popular Relational Database Systems Support database encoding. That is to say, when creating a database, you can specify its own character set settings. database data is stored in the specified encoding format. When an application accesses data, there is an encoding conversion at the entry and exit. For Chinese data, the character encoding settings of the database should ensure data integrity .??????, GBK, UTF-8 and so on are optional database encoding; can also choose ISO8859-1 (8-bit ), before writing data, the application must split a 16-bit Chinese character or Unicode character into two 8-bit characters. After reading the data, the application must combine the two bytes, it also identifies the SBCS characters. The function of database encoding is not fully utilized, but the programming complexity is increased. ISO8859-1 is not the recommended database encoding. During JSP/Servlet programming, You can first check whether the Chinese data is correct with the management function provided by the database management system.
Note that the encoding of the read data is generally Unicode in JAVA programs. The opposite is true when writing data.
4.5 frequently used troubleshooting skills
The most stupid and effective way to locate the Chinese encoding problem is to print the string's internal code after the program you think is suspected of processing. By printing the character string's internal code, you can find out when Chinese characters are converted to Unicode, when Unicode is converted back to Chinese characters, and when a Chinese character is converted to two Unicode characters, when is the Chinese string converted into a question mark? When is the high position of the Chinese string truncated ......
Selecting the appropriate sample string also helps to identify the type of the problem. For example, "aa, aa, aa" and other Chinese and English, GB, and GBK character strings. In general, English characters are not distorted no matter how they are converted or processed (if you encounter it, you can try to increase the length of consecutive English letters ).
5. Conclusion
In fact, the Chinese encoding of JSP/Servlet is not as complicated as you think. Although there are no rules for locating and solving problems and various runtime environments are different, the following principles are the same. Understanding character sets is the basis for solving character problems. However, as the Chinese Character Set changes, not only java programming, Chinese Information Processing problems still exist for a period of time.
6. References
Character Problem Review
Analysis and Solution of Chinese Character Problems in Java programming technology
GB18030
Setting language encoding in web applications: Websphere applications Server