Java character encoding fundamentals

Source: Internet
Author: User
In Java Development, garbled code is often encountered. Once such a problem is encountered, it is often difficult for everyone to admit that there is a problem with their own code. In fact, the coding problem is not so mysterious, so it is unpredictable to find out the essential process of Java coding. First look at the figure:

In fact, there are two problems with coding: Inside JVM and outside JVM.1. the Java file is compiled to form a classThe Java file encoding may vary, but the Java compiler automatically reads the encoding correctly according to the Java file encoding format to generate a class file, here the class file encoding is Unicode encoding (specifically UTF-16 encoding ). Therefore, the Java code defines a string: String S = "Chinese character"; no matter what encoding is used in the Java file before compilation, after compilation into a class, they are all the same ---- Unicode representation.2. JVM EncodingWhen the JVM loads the class file for reading, it uses the Unicode encoding method to correctly read the class file, so the originally defined string S = "Chinese character"; in the memory, the format is Unicode encoding. When you call String. getbytes (), you have already purchased the root cause for Garbled text. This method uses the default Character Set of the platform to obtain the byte array corresponding to the string. In the Chinese version of Windows XP, the default encoding is GBK. If you do not believe it, run: public class test {
Public static void main (string [] ARGs ){
System. Out. println ("Current JRE:" + system. getproperty ("Java. Version "));
System. Out. println ("Default JVM Character Set:" + charset. defaultcharset ());
}
} Current JRE: 1.6.0 _ 16
The default Character Set of the current JVM: GBK. When different systems and databases are encoded multiple times, if you do not understand the principles, it will easily lead to garbled characters. Therefore, in a system, it is necessary to unify the encoding of strings. This unified fuzzy point is to unify strings externally. For example, the method string parameters, Io stream, in the Chinese system, can be unified use GBK, gb13080, UTF-8, UTF-16 and so on can be, but to choose some larger character set, to ensure that any characters that may be used can be properly displayed and avoid garbled characters. (Assuming that all files use ASCII code), two-way conversion is impossible. It is important to note that the UTF-8 does not support all Chinese character set encoding, so in special circumstances, the UTF-8 to gb18030 may be garbled, however, a group of silly B Often in the Chinese system like to use UTF-8 encoding, not to mention a single out! The most silly B is, a system of multiple people do, some source code files using GBK encoding, someone using UTF-8, and others using gb18030. FK, are Chinese, is not outsourcing projects, with what UTF-8 ah, neural! The source code will all use gbk18030 and it will be OK, so that the ant script will not prompt unrecognizable character encoding during compilation. Therefore, for Chinese systems, it is best to select GBK or gb18030 encoding (in fact, GBK is a subset of gb18030) to avoid garbled code.3. encoding of strings in the memoryThe strings in the memory are not limited to strings directly loaded from the class code, but also some strings are read from text files and some are read through the database, it may also be constructed from byte arrays. However, they are basically not unicode encoded because of simple storage optimization. Therefore, we need to deal with a variety of Encoding Problems. Before processing, we must specify the "Source" encoding and then use the specified encoding method to correctly read the data to the memory. If it is a method parameter, you must specify the encoding of this string parameter, because this parameter may be passed by another Japanese system. When specifying the string encoding, You can correctly process the string as required to avoid garbled characters. When decoding a string, call the following method: getbytes (string charsetname)
String (byte [] bytes, string charsetname) instead of using signatures without character set names, you can recode the characters in the memory using the preceding two methods.

This article is from the "melyan" blog, please be sure to keep this source http://lavasoft.blog.51cto.com/62575/273608

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.