Question about Chinese character encoding in JSP/servlet (inber favorites)

Source: Internet
Author: User
Tags websphere application server
Question about Chinese character encoding in JSP/servlet (inber favorites)

Copyright Disclaimer: csdn is the hosting service provider of this blog. If this article involves copyright issues, csdn does not assume relevant responsibilities, ask the copyright owner to directlyArticleContact the author.

Question about Chinese character encoding in JSP/Servlet

From: http://www-900.ibm.com/developerWorks/cn/java/jsp_dbcsz/index.shtml

There are many excellent articles and discussions on the issue of DBCS character encoding in JSP/servlet on the Internet. This article will organize them and combine them with IBM WebSphere Application Server 3.5 (was) in the hope that it is not redundant.

Content:
Origin
GB2312-80, GBK, GB18030-2000 Chinese Character Set and encoding
During Chinese transcoding '? '. Sources of garbled characters
JSP/servlet Chinese character encoding and solutions in was
Conclusion
References

1. Origin of the problem

Each country (or region) specifies the character delimiter set for computer information exchange, such as the extended ASCII code of the United States, the GB2312-80 of China, JIS of Japan, etc, as the basis for information processing in the country/region, it plays an important role in unified coding. The character Collation is divided into sbcs (single-byte character set) and DBCS (dubyte Character Set) by length. Early software (especially the operating system), in order to solve the computer processing of local character information, various local versions (l10n) were introduced. to distinguish, Lang, codepage and other concepts were introduced. HoweverCodeIt is difficult to exchange information with each other because of overlapping scopes. The independent maintenance cost of each localized version of the software is high. Therefore, it is necessary to extract the commonalities in the localization work for consistent processing, so as to minimize the content of special localization processing. This is also called i18n ). The language information is further standardized as Locale information. The underlying character set to be processed becomes Unicode that contains almost all glyphs.

Currently, most of the software's core Character Processing Systems with internationalization features are Unicode-based. During software running, the corresponding local character encoding settings are determined based on the locale/lang/codePage settings at that time, and handle local characters accordingly. In the process, Unicode and local character sets must be converted to each other, or two different local character sets with Unicode as the center must be converted to each other. This method is further extended in the network environment. The character information at both ends of any network needs to be converted to acceptable content according to the character set settings.

The Java language uses Unicode to represent characters and complies with Unicode V2.0. JavaProgramWhether it is reading/writing files from/to the file system, writing HTML information to URL connections, or reading parameter values from URL connections, there will be character encoding conversion. Although this method increases programming complexity and can cause confusion, it is in line with the idea of internationalization.

Theoretically, character Conversion Based on Character Set settings should not cause too many problems. The fact is that the actual running environment of applications is different. Unicode is supplemented and improved with local character sets, and the implementation of systems or applications is not standardized, the problems encountered during transcoding have always plagued programmers and users.

2. GB2312-80, GBK, GB18030-2000 Chinese Character Set and encoding

In fact, the method to solve the problem of Chinese character encoding in Java programs is often very simple, but to understand the reasons behind it, to locate the problem, you also need to understand the existing Chinese character encoding and encoding conversion.

GB2312-80 is made in the initial stage of the development of Chinese character information technology in China, which contains most of the commonly used first-and second-level Chinese characters, and 9-area symbols. This character set is supported by almost all Chinese systems and international software. It is also the most basic Chinese character set. The encoding range is high: 0xa1-0xfe; low: 0xa1-0xfe; Chinese characters start from 0xb0a1 and end with 0xf7fe;

GBK is an extension of the GB2312-80 and is upward compatible. It contains 20902 Chinese characters and Its Encoding range is 0x8140-0xfefe, excluding the characters with a high position of 0x80. All its characters can be mapped to Unicode 2.0 one-to-one. That is to say, Java actually supports the GBK character set. This is the default character set for Windows and some other Chinese operating systems at present, but not all international software support this character set. It seems that they do not fully understand what GBK is. It is worth noting that it is not a national standard, but a standard. With the launch of the GB18030-2000 national mark, it will fulfill its historical mission in the near future.

On the basis of GBK, GB18030-2000 (gbk2k) further expands Chinese characters and adds the fonts of Tibetan and Mongolian ethnic minorities. Gbk2k fundamentally solves the problem of insufficient characters and insufficient fonts. It has several features,

    • It does not determine all the glyphs, but only specifies the encoding range, which will be extended later.
    • The encoding is variable, and the second part is compatible with GBK. The four-byte part is the expanded font and character bit, the encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, and four byte 0x30-0x39.
    • It is promoted in stages. The first requirement is that it can be fully mapped to all fonts of the Unicode 3.0 standard.
    • It is a national standard and mandatory.

At present, no operating system or software has supported gbk2k. This is the work of current and future localization.

Unicode introduction.

The encoding supported by Java is related to Chinese programming: (several of them are not listed in the JDK Documentation)

ASCII 7-bit, same as ascii7
ISO8859-1 8-bit, same as 8859_1, ISO-8859-1, ISO_8859-1, latin1...
GB2312-80 Same as gb2312, gb2312-1980, euc_cn, euccn, 1381, cp1381, 1383, cp1383, iso2022cn, iso2022cn_gb ......
GBK (Case Sensitive), same as ms936
Utf8 UTF-8
Gb18030 (Now only IBM jdk1.3 .? Supported), same as cp1392 and 1392

The Java language uses Unicode to process characters. But from another perspective, non-Unicode transcoding can also be used in Java programs. It is important to ensure that the Chinese character information at the program entry and exit is not distorted. If the use of ISO-8859-1 to deal with Chinese characters can also achieve the correct results. Many popular solutions on the Internet belong to this type. This method is not discussed in this article to avoid confusion.

3. During Chinese transcoding '? '. Sources of garbled characters

Both directions may have incorrect results:

    • Unicode --> byte. If the target code set does not contain the corresponding code, the result is 0x3f.

      For example:
      "\ u00d6 \ u00ec \ u00e9 \ u0046 \ u00bb \ u00f9". What is the result of getbytes ("GBK "? Ì é f? Hex is 3fa8aca8a6463fa8b4.

      take a closer look at the above results and you will find that \ u00ec is converted to 0xa8ac, and \ u00e9 is converted to \ xa8a6... its actual effective bit is getting longer! This is because some symbols in the gb2312 symbol area are mapped to some public symbol encodings, because these symbols appear in the ISO-8859-1 or some other sbcs character sets, they are more advanced in Unicode encoding, some of its valid bits only have 8 bits and overlap with the encoding of Chinese characters (in fact, this ing is only the encoding ing, and it is not the same when displaying it carefully. The symbols in Unicode are single-byte width, and the characters in Chinese characters are double-byte width. There are 20 such symbols between Unicode \ u00a0 -- \ u00ff. It is very important to understand this feature! It is not difficult to understand why Chinese character encoding Errors often result in garbled characters (actually symbolic characters) in Java programming, not all of which are '? 'Character, such as the above example.

    • byte --> Unicode. If the character identified by byte does not exist in the Source Code set, the result is 0xfffd.

      example:
      byte Ba [] = {(byte) 0x81, (byte) 0x40, (byte) 0xb0, (byte) 0xa1 }; new String (BA, "gb2312");

      the result is "? Ah ", the hex value is" \ ufffd \ u554a ". 0x8140 is a GBK character. The conversion table by gb2312 has no corresponding value, and \ ufffd is used. (Note: When this UNICODE character is displayed, because there is no corresponding local character, it is also applicable to the previous situation. It is displayed as "? ".)

In actual programming, JSP/servlet programs often obtain incorrect Chinese character information, which is the superposition of these two processes, sometimes even the result of repeated operations after the superposition of the two processes.

4. jsp/servlet Chinese character encoding Problems and Solutions in was

4.1 common Encoding Problems
Common JSP/servlet Encoding Problems on the Internet are generally manifested in browser or application, such:

    • Why are Chinese characters on the JSP/servlet page displayed in the browser '? '?
    • Why are Chinese characters on the servlet page displayed in the browser garbled?
    • How do Chinese characters in the Java application interface become blocks?
    • The JSP/servlet page cannot display GBK Chinese characters.
    • The JSP page is embedded in <%... %>, <% =... %> Chinese characters in the Java code contained in tags are garbled, but other Chinese characters on the page are correct.
    • JSP/servlet cannot receive Chinese characters submitted by form.
    • The correct content cannot be obtained for reading and writing JSP/servlet databases.

What hides behind these problems is character conversion and processing of various errors (except 3rd, due to errors in Java font settings ). To solve the problem of similar character encoding, you need to understand the running process of JSP/Servlet and check the various points that may cause problems.

4.2 encoding in JSP/servlet Web Programming
The JSP/servlet running on the Java application server provides HTML content for browser, as shown in the process:

The conversion of character encoding is as follows:

  • JSP compilation. The Java application server will read the JSP Source File Based on the JVM file. Encoding value, compile and generate the Java source file, and then write it back to the file system based on the file. Encoding value. If the current system language supports GBK, there will be no Encoding Problems. For an English system, for example, if Lang is an en_us Linux, Aix, or Solaris, set the JVM file. Encoding value to GBK. If the system language is gb2312, determine whether to set file. Encoding as required. Setting file. encoding to GBK can solve potential GBK character garbled issues.

  • Java must be compiled into. class before it can be executed in JVM. This process has the same file. Encoding issue as. Starting from here, Servlet and JSP are similar, but servlet compilation is not automatically performed. For JSP programs, compilation of the generated Java intermediate files is automatically performed (Sun is called directly in the program. tools. javac. main class ). therefore, if a problem occurs at this step, check the language environment of encoding and OS, or convert static Chinese characters embedded in JSP Java code to Unicode, either static text output should not be placed in Java code. For Servlets, You can manually specify the-encoding parameter during javac compilation.
  • servlet needs to convert HTML page content to an acceptable encoding content in browser and send it out. Depending on the implementation methods of each Java app server, some will query the accept-charset and accept-language parameters of browser or determine the encoding value by other guesses, and some will ignore it. Therefore, using fixed encoding may be the best solution. For Chinese Web pages, you can set contenttype = "text/html; charset = gb2312" in JSP or servlet. If the page contains GBK characters, set it to contenttype = "text/html; charset = GBK ", because IE and Netscape have different levels of support for GBK, You need to test this setting.
    because the 16-bit Java char is discarded when it is transmitted over the network, it also ensures that the characters on the servlet page (including embedded and obtained during servlet running) is the expected internal code. printwriter out = Res can be used. getwriter () replaces servletoutputstream out = res. getoutputstream (). printerwriter converts data based on the charset specified in contenttype (contenttype must be specified before this !); You can also use outputstreamwriter to encapsulate the servletoutputstream class and Use Write (string) to output Chinese character strings.
    for JSP, the Java application server should be able to ensure that the embedded Chinese characters are correctly transmitted at this stage.
  • This is an explanation of the question about the URL character encoding. If the parameter value returned by the Browser contains Chinese characters through the get/POST method, the servlet cannot obtain the correct value. In Sun's j2sdk, httputils. parsename does not consider the browser language settings when parsing parameters, but parses the obtained values in byte mode. This is the most talked about encoding on the Internet. Because this is a design defect, you can only re-parse the string in the bin mode, or solve it in the hack httputils class. Refer to Article 2 for more information, but it is best to change the Chinese encoding gb2312 and cp1381 to GBK. Otherwise, there will still be problems in the case of GBK Chinese characters.
    Servlet API 2.3 provides a new function httpserveletrequest. setcharacterencoding, which is used to specify the encoding that the application wants before calling request. getparameter ("param_name.

4.3 solutions in IBM WebSphere Application Server

WebSphere Application Server extends the standard servlet API 2.x to provide better multi-language support. Running in a Chinese operating system can process Chinese characters without any settings. The following description is only valid when was is running in an English system or GBK support is required.

In the above C and D cases, was must query the browser language settings. By default, ZH and ZH-Cn are all mapped to Java encoding cp1381 (note: cp1381 is only equivalent to a codePage of gb2312, not supported by GBK ). In this case, I think it is because I cannot confirm whether the operating system running browser supports gb2312 or GBK. However, the actual application system still requires GBK Chinese characters to appear on the page. The most famous is the name of Premier Zhu, "Jun" (rong2, 0xe946, \ u9555 ), therefore, you must specify encoding/charset as GBK. Of course, changing the default encoding in was is not as troublesome as described above. For a and B, refer to Article 5. Specify-dfile in the command line parameters of application server. encoding = GBK. For D, specify-ddefault in the command line parameter of application server. client. encoding = GBK. If-ddefault. Client. Encoding = GBK is specified, charset can be no longer specified in C.

The question listed above also has a question about tag <%... %>, <% =... the static text contained in Java code in %> cannot be correctly displayed. The solution in was is to set the correct file. encoding, you also need to set-duser in the same way. language = ZH-duser. region = cn. This is related to the setting of Java locale.

4.4 encoding during database read/write

In JSP/servlet programming, the encoding problem often occurs. Another issue is reading and writing data in the database.

Popular Relational Database Systems Support database encoding. That is to say, when creating a database, you can specify its own character set settings. database data is stored in the specified encoding format. When an application accesses data, there is an encoding conversion at the entry and exit. For Chinese data, the character encoding settings of the database should ensure data integrity. gb2312, GBK, UTF-8 and so on are optional database encoding; you can also choose ISO8859-1 (8-bit ), before writing data, the application must split a 16-bit Chinese character or Unicode character into two 8-bit characters. After reading the data, the application must combine the two bytes, it also identifies the sbcs characters. The function of database encoding is not fully utilized, but the programming complexity is increased. ISO8859-1 is not the recommended database encoding. During JSP/servlet programming, You can first check whether the Chinese data is correct with the management function provided by the database management system.

Note that the encoding of the read data is generally Unicode in Java programs. The opposite is true when writing data.

4.5 frequently used troubleshooting skills

The most stupid and effective way to locate the Chinese encoding problem is to print the string's internal code after the program you think is suspected of processing. By printing the character string's internal code, you can find out when Chinese characters are converted to Unicode, when Unicode is converted back to Chinese characters, and when a Chinese character is converted to two Unicode characters, when is the Chinese string converted into a question mark? When is the high position of the Chinese string truncated ......

Selecting the appropriate sample string also helps to identify the type of the problem. For example, "AA, AA, AA" and other Chinese and English, GB, and GBK character strings. In general, English characters are not distorted no matter how they are converted or processed (if you encounter it, you can try to increase the length of consecutive English letters ).

5. Conclusion

In fact, the Chinese encoding of JSP/servlet is not as complicated as you think. Although there are no rules for locating and solving problems, various runtime environments are different, however, the principle is the same. Understanding character sets is the basis for solving character problems. However, as the Chinese Character Set changes, not only Java programming, Chinese Information Processing problems still exist for a period of time.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.