Java correct to do string code conversion _

Java correct to do string code conversion __ character encoding format

Last Update:2018-07-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java right to do string encoding conversion
The internal representation of the string.
Strings are uniformly represented in Unicode (that is, utf-16 LE) in Java.
For String s = "Hello Oh!";
If the source file is GBK encoded and the operating system (Windows) default environment is encoded as GBK, the JVM parses the byte array into characters in GBK encoding and converts the character to an array of bytes in Unicode format for internal storage.
When this string is printed, the JVM converts Unicode to GBK depending on the locale of the operating system, and then the operating system displays the contents of the GBK format.

When the source file is UTF-8, we need to notify the compiler source code format, javac-encoding utf-8 ..., at compile time, the JVM resolves to a character by Utf-8, and then converts to a byte array in Unicode format, so no matter what format the source file is, the same string , the resulting Unicode byte array is exactly the same, and when displayed, it is also turned into GBK to display (related to the OS environment)

garbled how to produce. The original encoding format of the string is essentially inconsistent with the encoding format used to parse the read.
For example:
String s = "Hello Oh!";
System.out.println (New String (S.getbytes (), "UTF-8")); Error, because GetBytes () uses GBK encoding by default, and UTF-8 encoding when parsing, there must be an error.
where GetBytes () is a byte array that converts Unicode to the operating system default format, that is, "Hello," GBK format,
The Charset in the new String (bytes, Charset) is the way to specify the read bytes, which is specified as UTF-8, which treats the contents of the bytes as a UTF-8 format.
The following two ways will have the correct results, because their source content encoding and parsing of the code is consistent.
System.out.println (New String (S.getbytes (), "GBK"));
System.out.println (New String (S.getbytes ("UTF-8"), "UTF-8");

So, how do you use GetBytes and new String () to encode the conversion? There is a wrong way to spread the Internet:
Gbk--> UTF-8: New String (S.getbytes ("GBK"), "UTF-8"; , this way is completely wrong, because getbytes coding and UTF-8 inconsistent, is certainly garbled.
But why is it possible to use the new String (S.getbytes ("iso-8859-1"), "GBK", under Tomcat? The answer is:
Tomcat uses ISO-8859-1 encoding by default, which means that if the original string is GBK, the Tomcat transfers the GBK to Iso-8859-1,
By default, using Iso-8859-1 to read Chinese is definitely problematic, then we need to turn iso-8859-1 into GBK, and iso-8859-1 is single-byte encoded,
That is, he thinks that a byte is a character, so the conversion does not change any of the original byte arrays, because the byte array is originally composed of a single byte,
If the encoded content is not changed after the iso-8859-1 is GBK encoded, then S.getbytes ("Iso-8859-1") is actually the encoded content of the original GBK
The New String (S.getbytes ("iso-8859-1"), "GBK") can be decoded correctly. So that's a coincidence.

How to correct the GBK turn the UTF-8? (is actually a Unicode UTF-8)
String gbkstr = "Hello Oh!"; The source file is in the GBK format, or the string is read from the GBK file, converted to string into Unicode format
Using GetBytes to convert a Unicode string into an UTF-8-formatted byte array
byte[] Utf8bytes = gbkstr.getbytes ("UTF-8");
Then decode the byte array into a new string with Utf-8.
String utf8str = new String (utf8bytes, "UTF-8");
After simplification is:
UnicodeToUtf8 (String s) {
return new String (S.getbytes ("Utf-8"), "Utf-8");
}
UTF-8 GBK principle is the same
return new String (S.getbytes ("GBK"), "GBK");

In fact, the core work is done by GetBytes (CharSet).
JDK description for GetBytes: Encodes this String into a sequence of bytes using the named CharSet, storing the result into a new byte Array.

In addition, for reading and writing files,
OutputStreamWriter W1 = new OutputStreamWriter (New FileOutputStream ("D:\\file1.txt"), "UTF-8");
InputStreamReader (Stream, CharSet)
can help us to easily read and write files according to the specified encoding.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java correct to do string code conversion __ character encoding format

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support