From Chinese to International

Source: Internet
Author: User

References: from Chinese to International
Http://www.chedong.com/tech/unicode_java.html

I,Java first has a encoding/decoding process between "byte stream" and "bytes stream" for the input and output. This setting is determined based on the system configuration, why are there few Character Set problems in PHP and other applications, but Java has a good internationalization mechanism, but garbled characters often occur?

Simple Example:
There is aHi!The two text files are actually 4 Bytes: C4 E3 Ba C3
In the English operating system, the default encoding and decoding method is iso8859, so the result of reading directly from the file is4Bytes. Which of the following operations are performed in the program after decoding according to iso8859?4Java characters,Although each Java character is a 16-bit Unicode, But each characterStill 8-byte ing/U00c4/u00e3/u00ba/u00c3, soThe process is still "English ".In the display process, the browser correctly displays the byte stream as a corresponding Chinese character.

In a GBK-encoded operating system, a Java application reads four bytes from the file directly, and then decodes two bytes by GBK.16-bit Java characters/u4f60/u597d, Each word is the Chinese corresponding to the CJK block of the corresponding Unicode.

This is why Character Set problems are rare in PHP and other applications:
(1) The default in the server environment is generally English (ISO8859-1), equal to all processing is used in byte processing. The encoding method is not changed during the data input/output process, so garbled characters are rare.
(2) Java actually provides a mechanism to treat each Chinese character as one "word" instead of two bytes, the main garbled problem is often caused by inconsistent encoding and decoding methods during input and output. Furthermore, through the Unicode mechanism, in addition to implementing the localization adaptation of the program interface, even the content processed by the program itself can also be used in operating systems of different character sets. For example: the content edited in the traditional Chinese operating system can also be queried in the simplified Chinese operating system.

II,From Chinese to International

Chinese: The product runs in Windows 98. The default Character Set of the system is GBK. Therefore, the Web. xml of this application must be set:
<Web-app character-encoding = "GBK">
...
</Web-app>

The above two methods only make the application more localized, and the application itself is not a real international application.

Imagine how to design a global forum system: so that Chinese and Japanese users can easily browse and publish it? What kind of character set should be stored in the intermediate data processing stage? The answer is simple: Unicode. In the past, many articles have introduced how to design an international interface, but only the localized interface output of the application, but seldom mentioned how to adapt the data to internationalization in the intermediate processing process.

Internationalization:

In the input and storage phases, Unicode is used for processing and storage to facilitate internationalization of applications.. Google's design is a very good international application example. I will take the international support of Google search engine as an example to illustrate how to design international applications.

Google users often feel like this:

  1. Why did the Chinese interface appear when I first went to Google?
  2. Why search for Chinese characters on all websites: Sometimes the results of Japanese websites are matched? For example, the query "Google secret" is used as an example: Enter "Google secret" in the input box"
    Http://www.google.com/search? Hl = ZH-CN & newwindow = 1 & Q = Google + % C3 % D8 % C3 % dc & btng = Google % CB % D1 % CB % F7 & LR =

First, I will briefly describe Google's query processing process as follows:

  1. Input in the client browser;
  2. The query string is converted to word-based throttling Based on the client system encoding method (GBK) and sent to Google after URL encode;
  3. After goolge decode the input string URL, the string (byte string) is decoded to Unicode according to the client's system encoding method.
  4. The query process is completely Unicode-based matching. For example, for the "Chinese", the two characters are available in both simplified, traditional, and Japanese, therefore, no matter which language the page contains the two words, the page can match.
  5. Result set output: The content (UNICODE) of the query result set is encoded into byte streams by the client system encoding (GBK) and returned to the browser.

Details:

  • How does Google identify the "interface language" used by the browser: When Google obtains the query string, it generally knows the character set encoding method used by the client based on the HL = ZH-CN parameter, if the user visits Google for the first time, Google will determine based on the header information of Accept Language: zh_cn contained in the request sent by the browser, this is why many users can automatically identify it when they go to Google for the first time. This parameter is saved through cookies during subsequent query and page turning, and is always passed to Google through get (so you can also use the preference setting interface language ), in this way, the client encoding method is reliably identified.
  • Google query: You can see from the URL: the query "secret" passed in is actually % C3 % D8 % C3 % Dc => "secret" by GBK (default encoding method of Windows client) the format of the 4-byte encoding method followed by urlencode (for details about the Chinese encoding method, refer to: Chinese character encoding method). Google decodes the query string and converts it to Unicode, then, use the unicode encoded string for internal query operations. Pages in any language are converted to Unicode and stored in Google's data index library. In Unicode, Chinese characters are written in the same way as Chinese characters and are encoded in the same way. Therefore, if you do not specify a language filter, the Japanese webpage results will be hit first. Therefore, for Chinese client queries: if the corresponding characters are in Unicode or traditional Chinese, similar to Japanese ing, you can match the corresponding Japanese webpage, the traditional Chinese webpage ..., google's query results are Unicode first, and finally the Unicode results are converted into word throttling according to the client encoding method, and returned to the client.

From the above analysis, we can see that Unicode solves the internationalization problem of applications very beautifully. For the application front-end, the rest of the work is the localization process based on the Local encoding environment.

  1. Data is first converted to Unicode from the input, then processed, and stored in Unicode mode (UNICODE inside)
  2. During the data output process, only when the final output is to the client, Unicode data is converted to the local character set based on the local settings of the client, and configured with the localization outside Interface)

If the development of applications is only satisfied with self-sufficiency in the domestic market, it is natural that there will be a large number of "Chinese" ideas. However, if we compare "Chinese" to UCDOS and richwin, then this Chinese method will sooner or later be eliminated by the kernel-based Win95. After all, the core-level support for internationalization is a truly simplified front-end application design and general solutions. From the very beginning, Microsoft and sun have been designed for the global market. Therefore, they have attached great importance to international support. In contrast, the domestic software industry obviously lacks the relevant international standards and seldom actively participates in the formulation of relevant standards.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.