Java correct to do string code conversion __ character encoding format
Source: Internet
Author: User
Java right to do string encoding conversion
The internal representation of the string.
Strings are uniformly represented in Unicode (that is, utf-16 LE) in Java.
For String s = "Hello Oh!";
If the source file is GBK encoded and the operating system (Windows) default environment is encoded as GBK, the JVM parses the byte array into characters in GBK encoding and converts the character to an array of bytes in Unicode format for internal storage.
When this string is printed, the JVM converts Unicode to GBK depending on the locale of the operating system, and then the operating system displays the contents of the GBK format.
When the source file is UTF-8, we need to notify the compiler source code format, javac-encoding utf-8 ..., at compile time, the JVM resolves to a character by Utf-8, and then converts to a byte array in Unicode format, so no matter what format the source file is, the same string , the resulting Unicode byte array is exactly the same, and when displayed, it is also turned into GBK to display (related to the OS environment)
garbled how to produce. The original encoding format of the string is essentially inconsistent with the encoding format used to parse the read.
For example:
String s = "Hello Oh!";
System.out.println (New String (S.getbytes (), "UTF-8")); Error, because GetBytes () uses GBK encoding by default, and UTF-8 encoding when parsing, there must be an error.
where GetBytes () is a byte array that converts Unicode to the operating system default format, that is, "Hello," GBK format,
The Charset in the new String (bytes, Charset) is the way to specify the read bytes, which is specified as UTF-8, which treats the contents of the bytes as a UTF-8 format.
The following two ways will have the correct results, because their source content encoding and parsing of the code is consistent.
System.out.println (New String (S.getbytes (), "GBK"));
System.out.println (New String (S.getbytes ("UTF-8"), "UTF-8");
So, how do you use GetBytes and new String () to encode the conversion? There is a wrong way to spread the Internet:
Gbk--> UTF-8: New String (S.getbytes ("GBK"), "UTF-8"; , this way is completely wrong, because getbytes coding and UTF-8 inconsistent, is certainly garbled.
But why is it possible to use the new String (S.getbytes ("iso-8859-1"), "GBK", under Tomcat? The answer is:
Tomcat uses ISO-8859-1 encoding by default, which means that if the original string is GBK, the Tomcat transfers the GBK to Iso-8859-1,
By default, using Iso-8859-1 to read Chinese is definitely problematic, then we need to turn iso-8859-1 into GBK, and iso-8859-1 is single-byte encoded,
That is, he thinks that a byte is a character, so the conversion does not change any of the original byte arrays, because the byte array is originally composed of a single byte,
If the encoded content is not changed after the iso-8859-1 is GBK encoded, then S.getbytes ("Iso-8859-1") is actually the encoded content of the original GBK
The New String (S.getbytes ("iso-8859-1"), "GBK") can be decoded correctly. So that's a coincidence.
How to correct the GBK turn the UTF-8? (is actually a Unicode UTF-8)
String gbkstr = "Hello Oh!"; The source file is in the GBK format, or the string is read from the GBK file, converted to string into Unicode format
Using GetBytes to convert a Unicode string into an UTF-8-formatted byte array
byte[] Utf8bytes = gbkstr.getbytes ("UTF-8");
Then decode the byte array into a new string with Utf-8.
String utf8str = new String (utf8bytes, "UTF-8");
After simplification is:
UnicodeToUtf8 (String s) {
return new String (S.getbytes ("Utf-8"), "Utf-8");
}
UTF-8 GBK principle is the same
return new String (S.getbytes ("GBK"), "GBK");
In fact, the core work is done by GetBytes (CharSet).
JDK description for GetBytes: Encodes this String into a sequence of bytes using the named CharSet, storing the result into a new byte Array.
In addition, for reading and writing files,
OutputStreamWriter W1 = new OutputStreamWriter (New FileOutputStream ("D:\\file1.txt"), "UTF-8");
InputStreamReader (Stream, CharSet)
can help us to easily read and write files according to the specified encoding.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.
A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service