Java vs. Unicode:
Java's class file is encoded in UTF8, and the JVM runs with UTF16.
The Java string is Unicode-encoded.
In summary, Java uses the Unicode character set to make it easy to internationalize.
Which character sets are supported by Java:
Which character sets can Java recognize and handle correctly?
View the CharSet class, with the latest JDK supporting 160 character sets. You can use the static method availablecharsets to get all the Java-supported character sets.
Java code
- Assertequals (charset.availablecharsets (). Size ());
- set<string> charsetnames = Charset.availablecharsets (). KeySet ();
- Asserttrue (Charsetnames.contains ("Utf-8"));
- Asserttrue (Charsetnames.contains ("utf-16"));
- Asserttrue (Charsetnames.contains ("gb2312"));
- Asserttrue (charset.issupported ("Utf-8"));
When do you need to pay attention to coding issues?
1. Read data from an external resource:
This is related to how external resources are encoded, and we need to use the character set used by external resources to read external data:
Java code
- InputStream is = new FileInputStream ("Res/input2.data");
- InputStreamReader StreamReader = new InputStreamReader (IS, "GB18030");
As we can see here, we use the GB18030 encoding to read the external data, which can be verified by looking at the encoding of StreamReader:
Java code
- Assertequals ("GB18030", streamreader.getencoding ());
It is precisely because we specified the correct encoding for the external resource that we can decode correctly when it is turned into a char array (GB18030-i Unicode):
Java code
- char[] chars = new char[is.available ()];
- StreamReader.Read (chars, 0, is.available ());
But the code we often write is like this:
Java code
- InputStream is = new FileInputStream ("Res/input2.data");
- InputStreamReader StreamReader = new InputStreamReader (IS);
What encoding method does inputstreamreader use to read external resources? Unicode? No, the encoding used at this time is the default character set of the JVM, which is determined when the virtual machine starts, usually based on the locale and the charset of the underlying operating system. The default character set for the JVM can be obtained in the following ways:
Java code
- Charset.defaultcharset ();
Why are you doing this? Because we read data from external resources, and external resources are encoded in the same way that the operating system uses the character set, it is understandable to use this default.
Well, then I created a file from my IDE ideas and read the data from this file in the JVM's default encoding, but the data read was garbled. Why? Oh, actually because the file created by ideas is encoded in utf-8. To get a file that the JVM defaults to encode, try it by manually creating a TXT file.
2. Converting strings and byte arrays to each other
We usually convert a string to a byte array using the following code:
Java code
- "String". GetBytes ();
But have you ever noticed the code used in this conversion? In fact, the above code is equivalent to the following sentence:
Java code
- "String". GetBytes (Charset.defaultcharset ());
That is, it converts a string into a byte array based on the JVM's default encoding, rather than what you might think of as Unicode.
Conversely, how do you create a string from a byte array?
Java code
- New String ("string". GetBytes ());
Again, this method uses the default character set of the platform to decode the specified array of bytes (where decoding refers from a character set to Unicode).
String encoding Myth:
Java code
- New String (Input.getbytes ("iso-8859-1"), "GB18030")
What does the code above represent? Someone would say, "convert the input string from ISO-8859-1 encoding to GB18030 encoding." If this is true, then how do we explain that the Java strings we just mentioned are Unicode encoded?
This statement is not only defective, but also is wrong, let us hit analysis, in fact, the fact is this: we should have used GB18030 code to read the data and decoded into a string, but the result is iso-8859-1 encoding, resulting in a wrong string generation. To recover, the string is restored to the original byte array, and then decoded again into a string by the correct encoding GB18030 (that is, the GB18030 encoded data is converted to a Unicode string). Note that the string is always Unicode encoded.
But the code conversion is not negative negative is so simple, here we can correctly convert back, because Iso8859-1 is a single byte encoding, so each byte is converted to a String as is, that is, although this is a wrong conversion, but the encoding does not change, So we still have a chance to convert the code back!
Summarize:
So, when we deal with Java coding problems, we need to be clear about three concepts: the encoding used by Java: The UNICODE,JVM platform default character set and the encoding of external resources.
http://www.iteye.com/topic/311583
Analysis of Java Coding (note three concepts) (turn)