From Chinese to International

Source: Internet
Author: User
Tags character set i18n interface locale linux
Summary:
1 Follow the Java Internationalization Design Framework specification: How to use the Linux system localization settings to enable Java applications to support Chinese
2 in accordance with the Java WebApp Design Framework specification: Solve the problems related to Urlencoder.encode () method and system default encoding by Web.xml setting
3 take Google's search engine as an example: how to apply internationalization and localization to your application design (Unicode inside Locale outsite)

Java applications support Chinese through localized settings for Linux systems

Analysis and solution of Chinese character problem in Java programming technology This article is very good, until recently also frequently posted by some websites, including an example of a lot of Chinese programmers encounter the problem of Chinese characters garbled thinking: "GB2312 it" (Chinese)

The original text reads as follows:
>>>>>>>
...... Not long ago, a technical friend of mine wrote to me that he had finally found the root of the Java Servlet Chinese problem. For two weeks, he has been plagued by a Chinese problem with the Java Servlet, because every string that has a character in it must be cast to get the correct result (as if it were the only solution known to all). Later, he did not want to continue to do so, because such things really should not be the work of senior programmers, he found the Servlet decoding the source code analysis, because he suspected that the problem is in the decoding part. After four hours of struggle, he finally found the root of the problem. It turns out his suspicions are correct, and the Servlet's decoding part does not consider two bytes at all, and directly treats%xx as a single character. (The original Java Soft will also make this low-level error!) )

If you are interested in this issue or if you are experiencing the same problems, you can follow his steps to modify the Servlet.jar:

Locate the static private String parsename in source code httputils, copy SB (StringBuffer) to byte bs[before returning, and then return to New String (BS, "GB2312" )。 After making the above changes, you need to decode yourself:

HashTable form=httputils. Parsequerystring (Request.getquerystring ()) or

Form=httputils.parsepostdata (...)

Never forget to put it in the Servlet.jar after compiling it.
......
<<<<<<<<<

Ask "advanced" programmers a few questions:
1 If this is a commercial product, does the customer need your hacking Servlet.jar to run this application?
2 can this product only be used in Chinese GB2312? If it is Japanese application how to do, such as law hacking?

Maybe I was wrong, but my feeling is that it is not the Java SOFT, because the Java application localization is not implemented at the Web application level, but the JVM's system default encoding is implemented according to the operating system's environment settings (locale) changes. At the end of 2000, the Linux locale support for Chinese was limited, so it was not possible to change the default encoding of the JVM by changing the default encoding of the system to GB2312 on Linux locale settings.
About Linux support for l10n see: Linux programmers must read: Chinese culture and GB18030 standards

How do you set up to allow Linux to support Chinese encoding from the system level?

So under redhat6.x, no matter how you set up locale, the default default file.encoding of the system is iso_8859_1 because redhat6.2 is based on glibc-2.1.x. The glibc-2.2.x that is based on the redhat7.x system kernel has more complete support for l10n, so you can set the
Lc_all=zh_cn. Gb2312;export Lc_all
Lang=zh_cn. Gb2312;export LANG
Let the system default encoding mode into GB2312 GBK ... This changes the default encoding of the JVM (file.encoding), and then any byte stream to the character streams conversion, the JVM will be converted according to the system default encoding.

On Linux based on glibc2.2: it is possible to change the default encoding of the system by locale settings, thus changing the default coding and decoding method of the application.

Here are 2 points I would like to state:

1 to be fair to commercial operating systems: Linux's support for internationalization lags far behind commercial operating systems such as Windows Solaris: 2 years or more.
2 Linux is built on GNU tools: without GNU there is no Linux. Therefore, Linux support for localization, but also in the core of the glibc-2.2.x Chinese locale have a better support after the gradual development.

Resolving problems associated with the Urlencoder.encode () method and system default encoding by Web.xml settings

As far as I understand it, in the context of JDK1.3, it is very inconsistent with the Java internationalization specification when using Urlencoder:
For example, when using Urlencoder.encode (String s) in Chinese WIN98, for example, the 2 words "Chinese" are directly encoding, the result is "%3f%3f" => "??". The reason is very simple, "Chinese" in the process of encode () the need to encode GBK encoding into 4 byte after the urlencoding is correct. This is also corrected in the JDK1.4. The method encode (string s) has been discouraged and replaced with a encode (string s, String enc) that specifies the string encoding, in addition to the string that needs to be urlencoding. In this way, Urlencoder can be independent of the system's default encoding.

In JDK1.3, an application based on the Web-app framework, this problem can be solved by setting it in Web-inf/web.xml:
<web-app character-encoding= "Your_system_default_file.encoding" >
...
</web-app>


If the product is run in Chinese WINDOWS98, the default character set is GBK, then the application's web.xml need to be set to:
<web-app character-encoding= "GBK" >
...
</web-app>

Unicode inside Locale Outsite

The above 2 methods still only make the application more convenient localization, but the application itself is not a true internationalization application. Imagine how to design a global forum system that allows users in both Chinese and Japanese to easily browse and publish. Where should the data be stored in that character set? The answer is simple: Unicode. Finally, I use Google's international language search engine as a design example to illustrate how to implement the design of international applications: Google is a very good example of international application (but I did not say that Google is Java do yo).

Google users often feel the same way:
Why did I go to Google for the first time, what appeared is the Chinese interface?
Why find Chinese in all websites: Sometimes the results of a Japanese website? For example: "Google Secrets"

Take the "Google Secret" query as an example: we enter "Google secret" in the input box
Http://www.google.com/search?hl=zh-CN&newwindow=1&q=google+%C3%D8%C3%DC&btnG=Google%CB%D1%CB%F7 &lr=

The simple process description is as follows:
Input: Query (by client encoding) =>goolge (decodes input bytes into Unicode) => Query Unicode index =>unicode result set => output: Query results (encoded as a byte stream by client encoding)

Specific Description:

How Google identifies the "interface language" used by browsers: When Google obtains this query string, it generally knows the character set encoding used by the client, based on the HL=ZH-CN parameter, If the user first visits: Google will be based on the browser sent to the request contained in the Accept LANGUAGE:ZH_CN this header information to determine, this is why many users are the first time to Google when it can automatically identify the reasons. This parameter is saved by a cookie in the subsequent query and page-flipping process and is passed to Google through the Get mode (so you can also use the Preferences interface language) to reliably identify the client's encoding.
How Google queries: perhaps from the URL you can see: The past "secret" This query is actually%c3%d8%c3%dc=> "secret" These 2 words are encoded in GBK (Windows client default encoding) in the form of a 4-byte UrlEncode (refer to the Chinese encoding method), Google decodes the query string in this way and turns it into Unicode. The internal query operation is then carried out using this Unicode encoded string. The pages in any language are first converted to Unicode and stored in Google's Data Index library. In Unicode, the same words are written in Japanese and Chinese, using the same encoding. Therefore, if you do not specify the language filter, the results of the Japanese web page is first hit; therefore, for the Chinese client query: If the words are mapped in Unicode, you can query the corresponding Japanese page, Traditional Chinese page ..., Google's query results are also first Unicode, and finally the Unicode results in accordance with the client's encoding to convert to a byte stream, returned to the client.
From the above analysis we can see: Unicode is actually very beautiful to solve the application of the internationalization problem
1 data is centrally stored in a Unicode fashion that can be converted to any character set (Unicode inside)
2 then convert the cost of the character set according to the localization of the client (Locale outside)

If the previous application of the Chinese-style design equivalent to Ucdos and richwin words, this way sooner or later to be the kernel of the Chinese WIN95 eliminated. After all, core-level support for internationalization is a truly generic solution for simplifying application design. Many Microsoft and Sun products were designed for the world market from the outset, and the "Chinese" mentality stems from the fact that our software development is content to be self-sufficient in small farmers.

Reference Documentation:
The internationalization design of Java
Http://java.sun.com/docs/books/tutorial/i18n/index.html

Linux internationalization localization and Chinese culture
Http://www.linuxforum.net/doc/i18n-new.html

Linux programmers must read: Chinese culture and GB18030 standards
Http://www.ccidnet.com/tech/os/2001/07/31/58_2811.html

Unicode FAQ
Http://www.cl.cam.ac.uk/~mgk25/unicode.html
Http://www.linuxforum.net/books/UTF-8-Unicode.html (Chinese version)

Analysis and solution of Chinese character problem in Java programming technology
Http://www-900.ibm.com/developerWorks/java/java_chinese/index.shtml

How to encode Chinese characters:
Http://www.unihan.com.cn/cjk/ana17.htm

* Note: l10n i18n are abbreviations: Use the first letter of the English word and the number of letters in between.
L10n:localization localization
I18n:internationalization Internationalization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.