Java encoding and character (2)

Last Update:2015-09-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from: Http://lavasoft.blog.51cto.com/62575/273608/Java Development, often encounter garbled problem, once encountered this problem, often is very ripped eggs, everyone is reluctant to admit that their own code has a problem. In fact, the coding problem is not so mysterious, so elusive, to understand the nature of Java coding process is the truth. Let's take a look at the diagram: In fact, there are two aspects to the coding problem: inside the JVM and outside the JVM. 1, Java files compiled after the formation of classJava files here may be encoded in a variety of ways, but the Java compiler will automatically follow the Java file encoding format correctly read after the class file, where the class file encoding is Unicode encoding (specifically, UTF-16 encoding). Therefore, a string is defined in Java code: string s= "kanji"; no matter what encoding the Java files use before compiling, they are all the same----Unicode encoding after they are compiled into class. 2. Encoding in the JVMWhen the JVM load class file reads with Unicode encoding to correctly read the class file, the originally defined string s= "Chinese character"; The Representation in memory is Unicode encoding. When the call String.getbytes (), in fact, has been garbled to buy a curse. This method uses the default character set of the platform to get the byte array corresponding to the string. In the windowsxp Chinese version, the default encoding used is GBK, not the letter run: PublicclassTest {
PublicStaticvoidMain (string[] args) {
System.out.println ("Current JRE:"+ System.getproperty ("Java.version"));
System.out.println ("default character set for current JVM:"+ Charset.defaultcharset ());
}
} Current Jre:1.6.0_16
Default character set for the current JVM: GBK when different systems and databases have been encoded many times, it is easy to get garbled if the principle is not understood. Therefore, in a system, it is necessary to do a unity of the encoding of strings, this unified fuzzy point, is the unification of the outside. For example, the method string parameters, Io stream, in the Chinese system, can be unified use GBK, GB13080, UTF-8, UTF-16 and so on, just to choose some larger character set, to ensure that any possible use of the characters can be displayed properly, to avoid garbled problems. (assuming that all files are ASCII) then the bidirectional conversion is not possible. To pay special attention to IS, UTF-8 is not able to accommodate all the Chinese character set encoding, therefore, in special cases, UTF-8 turn GB18030 may appear garbled, however a bunch of silly b often in the Chinese system like to use UTF-8 code without saying why out! The most silly B is that a system of many people do, source code files Some people with GBK code, someone with UTF-8, still someone with GB18030. FK, are Chinese, is not outsourced projects, with what UTF-8 ah, nerves! All the source code is OK with GBK18030, lest the ant script compile to prompt the non-recognizable character encoding. Therefore, for the Chinese system, it is best to choose GBK or GB18030 encoding (in fact, GBK is a subset of GB18030), in order to maximize the avoidance of garbled phenomenon. 3, in-memory string encodingIn-memory strings are not confined to strings that are loaded directly from the class code, some are read from a text file, some are read from a database, or they are built from a byte array, but they are basically not Unicode encoded, for the simple reason that storage is optimized. Therefore, it is necessary to deal with a variety of coding problems before processing, you must clear the "source" encoding, and then correctly read into memory with the specified encoding method. If it is a parameter of a method, you must actually explicitly encode the string parameter, because this parameter may be passed by another Japanese system. When the string encoding is defined, the string can be handled correctly as required to avoid garbled characters. When decoding a string, you should call the following method: GetBytes (String charsetname)
String (byte[] bytes, String charsetname) instead of using a method signature without a character set name, the two methods above allow you to recode the characters in memory. Example description

 Public classMain { Public Static voidMain (string[] args) {System.out.println (System.getproperty ("File.encoding")); String String= "ABCD in"; byte[] b1=string.getbytes ();        System.out.println (b1.length); Try {            byte[] B2=string.getbytes ("Iso-8859-1"); System.out.println (NewString (B2, "Utf-8"));        System.out.println (b2.length); } Catch(unsupportedencodingexception e) {//TODO auto-generated Catch blockE.printstacktrace (); }    }}

Output

UTF-87ABCD? 5

The default encoding for development environment settings is UTF-8

So here GetBytes () will default to use UTF-8, in memory This string is stored in Unicode, two bytes per character, using UTF-8 encoding method, each character in the UTF-8 encoding format to get a byte array, Because Chinese characters account for three bytes, a total of 7 bytes

and iso-8859-1 each character takes one byte, for the English byte directly takes down 8 bits, each character takes only one byte, each kanji only takes half of the characters. The other half of the bytes are missing. Since this half of the characters cannot find the corresponding character in the character set, the default is to use the Code 63 instead, that is?.

The length of the byte array that iso-8859-1 eventually gets is the length of the string.

Java encoding and character (2)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java encoding and character (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java encoding and character (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support