Java and coding issues Chuanjiang-How to understand Java Unicode encoding

Last Update:2014-12-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java developers must keep in mind that characters exist only in one form in Java, which is Unicode (no specific encoding is selected, directly using their numbering in the character set, which is the only way to unify). Because Java uses Unicode encoding, char occupies 2 bytes in Java. 2 bytes (16 bits) to represent one character.

Java in this case is defined in the JVM, in memory, in the code declaration of each char, string type of variable.

For example:

1234567

System. Out. println(System. GetProperty("file.encoding")); Char han = ' Yong '; System. Out. Format("%x", (short)Han); Format the second parameter (short integer) as a hexadecimal output, beginning with 0x //Output 6c38Char han1 = 0x6c38; System. Out. println(han1); //Output Forever

In other words, as long as we read the word "Yong" correctly, then its representation in memory must be 0x6c38, and no other value can replace the word.

The compromise convention of the JVM makes one character divided into two parts: the JVM's internal and the OS's file system. Within the JVM, Unicode is used to denote that when the character is moved from inside the JVM to the outside (that is, when it is saved as the content of a file in the file system), the encoding is converted and the specific encoding scheme is used. So it can be said that all encoding conversions occur only at the boundary, where the JVM and OS are at the junction, which is where various input/output streams (or Reader,writer classes) work.

All I/O can basically be divided into two main camps: character-oriented input/output streams, byte-oriented input/output streams. This "orientation" refers to the sense in which these classes are consistent in handling input/output.

If byte-oriented, this kind of work to ensure that the file binary content in the system and read into the JVM internal binary content consistent, can not transform any order of 0 and 1. This input/output method is suitable for such as video files or audio files, as there is no need to transform any file content.

The character-oriented I/O is the same as the characters that want the files in the system and read into memory. For example: Our Chinese version of the XP system has a GBK text file, which has a "forever" word, we do not care about the word GBK encoding is what, only want to use the character-oriented I/O to read it into memory and stored in a char variable, I/O system do not directly "Yong" The GBK encoding of the word is put into this char type variable, we do not care about this char variable specific binary content exactly how much, only hope that this character read in and still is "forever" word.

In this sense, the character-oriented I/O class, which is reader and writer classes, actually implicitly does the encoding conversion, when the output, the in-memory Unicode characters are encoded using the system default encoding, and at the input, The characters already encoded in the file system are restored using the default encoding scheme. It is important to note that reader and writer only use this default encoding for conversion, and not for a reader and writer to specify the encoding to use for the conversion. This also means that if you use the Chinese version of the WindowsXP system, which holds a UTF8 encoded file, when using the reader class to read, it will also use GBK to convert, the content of the conversion is of course not. This is actually a kind of a fool-style feature offering, which is a good thing for most novice users (and for advanced users who do not need cross-platform, Windows generally uses Gbk,linux UTF8).

If you use a file other than GBK encoding, you must use the encoding transformation: a conversion between a character and a byte. Therefore, the Java I/O system can specify the place to convert the encoding, that is, in the character and byte conversion place, that is Inputstremreader and OutputStreamWriter. These two classes are the adapter classes for byte streams and character streams that assume the task of encoding conversions.

Since Java uses Unicode to encode characters, the Unicode of the Chinese character "I" is 2 bytes. The String.getbytes (encoding) method is to get a byte array representation of the specified encoding, usually gbk/gb2312 is 2 bytes, and Utf-8 is 3 bytes. If you do not specify encoding, the system default encoding is taken.

For example

1234567891011

//utf8 Kanji Three bytesb = str. GetBytes("UTF-8"); System. Out. println("UTF8 encoding format, a Chinese character occupies" + b. Length + "bytes"); //chaotic, very weird, don't understand!!!!!! b = Str. Getbytes ( "UTF-16" System. Out. println("UTF16 encoding format, a Chinese character occupies" + b. Length + "bytes"); The//ansi code is less than 128 characters in one byte, the other takes up two bytesb = str. GetBytes(); System.out. Println ( "default encoding" +< Span class= "crayon-h" > system. Getproperty ( "file.encoding" + "encoded format, a Chinese character occupies" + b . length + "byte" )

Since the JDK is an international version, at compile time, if we do not specify the encoding format of our Java source program with the-encoding parameter, Then Javac.exe first get our operating system by default encoding format, that is, when compiling Java programs, if we do not specify the source program file encoding format, the JDK first obtains the operating system's file.encoding parameter (it is the operating system default encoding format, such as Win2K, which has a value of GBK), then the JDK translates our Java source program into memory from the file.encoding encoded format into the Java internal default Unicode format. And then Javac compiles the converted Unicode file into a. class file, at which point the. class file is Unicode encoded, it is temporarily placed in memory, and the JDK then saves this Unicode-encoded compiled class file to our operating system, which we see. clas S file. For us, the. class file that we finally get is the class file that the content is saved in Unicode encoded format, which contains the Chinese string inside our source program, except that it has been converted to Unicode format by file.encoding format.

Java and coding issues Chuanjiang-How to understand Java Unicode encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java and coding issues Chuanjiang-How to understand Java Unicode encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java and coding issues Chuanjiang-How to understand Java Unicode encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support