"Character encoding" Java character encoding detailed solution and problem discussion

Source: Internet
Author: User

First, preface

Following the completion of the byte encoding content, now the analysis of the characters in Java coding problems, and by this problem, also leads to a more interesting question, the author has not found the answer to this question. Also hope that the friends of the garden pointed out.

Second, Java character encoding

Analyzing directly on the code seems to be more of a sensation.

View Code

Operation Result:

View Code

Description: We know the following information through the results.

1. In Java, the Chinese is represented by ASCII code as 3F, the actual corresponding symbol '? ', with Iso-8859-1 as 3F, the actual corresponding symbol is also '? ', which means that Chinese has gone beyond the range of ASCII and iso-8859-1 representations.

2. The UTF-16 uses the big-endian storage, which is the addition of FE FF to the byte array, and the FE FF is also counted in the character array length.

3. After specifying the UTF-16 (UTF-16BE) or small end (Utf-16le) mode, there is no FE FF or FF FE control, and the corresponding byte array size does not contain the size of the control.

4. The Unicode representation is the same as UTF-16.

5. The GetBytes () method defaults to using UTF-8.

Third, char indicates the problem

We know that in Java the Char type is two bytes in length, and we look at the next example.

public class Test {public        static void Main (string[] args) throws Exception {        char ch1 = ' a ';    1        char ch2 = ' li ';//2        char CH3 = ' \uffff ';//3        char CH4 = ' \u10000 ';//4        }}

Question: Does the reader think this code can be compiled? If you can not code through why, and specifically that line of code error?

Analysis: Copy This example into Eclipse, navigate to the error, find that the fourth line of code has an error, there is such a hint, Invalid character constant.

Answer: The key to the problem is that the char type is two bytes long and the Java characters are UTF-16 encoded. The ' \u10000 ' has clearly exceeded the range that two bytes can represent, and a char cannot be represented. More specifically, Char indicates that the range is the 0th plane (BMP) in the Unicode table, from 0000-FFFF (hexadecimal), and the code point on the secondary plane, that is, 010000-10FFFF (hex), must be represented by four bytes.

With this understanding, let's look at the following code

public class Test {public        static void Main (string[] args) throws Exception {        char ch1 = ' a ';        char CH2 = ' li ';        char CH3 = ' \uffff ';        String str = "\u10000";        System.out.println (String.valueof (CH1). Length ());        System.out.println (String.valueof (CH2). Length ());        SYSTEM.OUT.PRINTLN (String.valueof (CH3). Length ());        System.out.println (Str.length ());}        }

Operation Result:

1112

Note: From the results we can know that all the code points on the BMP (including ' A ', ' Li ', ' \uffff ') length is 1, all the code points on the auxiliary plane length is 2. Note the difference between the length function of a string and the length field of a byte array.

Iv. discovery of the problem

When writing Java applets, I generally do not open eclispe, but directly in the nodepad++ write, and then through the Javac, Java command to run the program, view the results. It is because of this habit, found the following questions, please listen to the author slowly, to please the Garden friends pointing.

There are simple programs that ignore the meaning of strings.

public class Test {public        static void Main (string[] args) throws Exception {String str = "I i i i i i i i i        \ud843\udc30"; C2/>system.out.println (Str.length ());}    }

Description: The program function is very simple, is to print the string length.

4.1 Two methods of compiling

1. The author compiles through Javac Test.java, compiles through. Then run the program through Java test and the results are as follows:

  

Description: According to the results we can speculate that the character ' I ' for length 1,\ud843\udc30 for length 10, where \u for length 1.

2. The author compiles and compiles through javac-encoding utf-8 Test.java. Then run the program through Java test and the results are as follows:

  

Explanation: This result is very good understanding, the character ' I ', \ud843, \UDC30 are all in BMP, all is length 1, therefore altogether is 9.

Through the two methods of compilation, the results are different, after the lookup data know Javac Test.java default is to use GBK encoding, as specified javac-encoding GBK Test.java to compile.

4.2. View the class file

1. View the Java Test.java class file, open with Winhex, with the following results:

Description: The red flag in the figure gives the string "I am I i am I i i i \ud843\udc30" in the approximate position. Because we analyzed earlier, class file storage using UTF-8 encoding, so, first calculate E9 8E B4, get Unicode code point for 94B4 (hex), look up the Unicode table, found that the character is ' contact ', which is completely not related to ' I '. and E9 8E B4 behind the E6 9E, and E9 8E B4 is not equal, logically speaking, the same character encoding should be the same. Later found that the red mark Place seems a bit of a rule, that is E9 8E B4 E6 9E E5 9E 9C (nine bytes) means ' I I ', repeated the loop 3 times, the character ' I I I I I I ', after the E9 8E B4 E6 85 (five bytes) represents ' me ', a total of 7 ' I ', obviously again out Now I doubt it.

The guess is that the Javac Test.java is compiled with GBK encoding, and the class file is stored in the format UTF-8 encoded. There must be some kind of conversion relationship between the two operations, and the final class file also includes the corresponding information.

2. View Java-encoding-utf-8 Test.java's class file and open it with Winhex with the following results:

Description: The red flag gives the approximate position of the string, E6 88 91, which, after calculation, does correspond to the character ' I '. There is no doubt about it.

4.3 Quest for doubt

1. Change the value of the string, using the following code:

public class Test {public        static void Main (string[] args) throws Exception {        String str = "I am coder";        System.out.println (Str.length ());}    }

Similarly, use the Javac Test.java, java Test command. The results are as follows:

  

This is even more puzzling. Why do you get 8.

2. Access to Data results

In Javac, if no-encoding parameter is specified to specify the encoding format of the Java source program, Then Javac.exe first get our operating system by default encoding format, that is, when compiling Java programs, if we do not specify the source program file encoding format, the JDK first obtains the operating system's file.encoding parameter (it is the operating system default encoding format, such as Win2K, which has a value of GBK), then the JDK translates our Java source program into memory from the file.encoding encoded format to the Java internal default UTF-16 format. The class file is then output, and we know that class is encoded in UTF-8, which contains the Chinese string in our source program, except that it has been converted to UTF-8 format by file.encoding format.

V. Questions raised

1. After compiling with Javac Test.java, why do you get the format of the class file described above (that is, how GBK, UTF16, UTF8 specifically).

2. After compiling with Javac Test.java, why is the result one is 17, and the other one is 8.

Vi. Summary

The process of exploration is very interesting, this problem has not been resolved, the answer to the question will be posted, but also welcomed the idea of the reader to discuss the exchange. Thank you for watching the Garden friends ~

Reference Links:

http://blog.csdn.net/xiunai78/article/details/8349129

  

  

"Character encoding" Java character encoding detailed solution and problem discussion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.