Java correctly performs string encoding conversion

Source: Internet
Author: User

Internal representation of the string?

The string is represented in unicode in java (that is, the UTF-16 LE ),

For String s = "Hello! ";

If the source code file is GBK encoded and the operating system (windows) uses the default environment encoding GBK, the JVM parses the byte array into characters according to the GBK encoding, then, convert the character to a byte array in unicode format for internal storage.

When this string is printed, JVM converts unicode to GBK based on the local language environment of the operating system, and then displays the content in GBK format.

When the source code file is UTF-8, we need to notify the compiler of the source code format, javac-encoding UTF-8 ..., during compilation, the JVM parses the string into a byte array in unicode format according to UTF-8, No matter what format the source file is, the same string, the final unicode byte array is completely consistent. It is also converted to GBK for display (related to the OS environment)

How can garbled code be generated?Essentially, the original encoding format of the string is inconsistent with the encoding format used for parsing during reading.

For example:

String s = "Hello! ";

System. out. println (new String (s. getBytes (), "UTF-8"); // error, because getBytes () uses GBK encoding by default, while UTF-8 encoding is used during parsing, certainly an error.

GetBytes () is a byte array that converts unicode to the default Operating System Format, that is, the "hello" GBK format,

Charset in new String (bytes, charset) specifies the way bytes is read, here it is specified as UTF-8, that is, the content of bytes is treated as UTF-8 format.

Both of the following methods have correct results, because their source content encoding is consistent with the resolution encoding.

System. out. println (new String (s. getBytes (), "GBK "));

System. out. println (new String (s. getBytes ("UTF-8"), "UTF-8 "));

So how can we use getBytes and new String () for encoding and conversion?There isErrorMethod:

UTF-8: new String (s. getBytes ("GBK"), "UTF-8);, this method is completely wrong, because the encoding of getBytes is inconsistent with the UTF-8, it must be garbled.

But why can I use new String (s. getBytes ("iso-8859-1"), "GBK") in tomcat? The answer is:

Tomcat uses iso-8859-1 encoding by default, that is to say, if the original string is GBK, tomcat transmission process, the GBK into a iso-8859-1,

By default, the use of iso-8859-1 to read Chinese is certainly a problem, then we need to convert the iso-8859-1 into GBK, And the iso-8859-1 is single byte encoding,

That is to say, if a byte is a character, this conversion will not change the original byte array, because the byte array is originally composed of a single byte,

If we use GBK encoding before, then the encoding content after converting to iso-8859-1 has not changed, then s. getBytes ("iso-8859-1") is actually the original GBK encoding content

New String (s. getBytes ("iso-8859-1"), "GBK") can be decoded correctly. So it is a coincidence.

How to correctly convert GBK to UTF-8? (Actually unicode to UTF-8)

String gbkStr = "Hello! "; // The source code file is in GBK format, or the string is read from the GBK file and converted to string to unicode format.

// Use getBytes to convert unicode strings into byte arrays in UTF-8 format

Byte [] utf8Bytes = gbkStr. getBytes ("UTF-8 ");

// Then decodes the new string from the byte array using UTF-8.

String utf8Str = new String (utf8Bytes, "UTF-8 ");

Simplified:

UnicodeToUtf8 (String s ){

Return new String (s. getBytes ("UTF-8"), "UTF-8 ");

}

The principle of converting UTF-8 to GBK is also the same

Return new String (s. getBytes ("GBK"), "GBK ");

In fact, the core work is done by getBytes (charset.

JDK description of getBytes: Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

In addition, for reading and writing files,

OutputStreamWriter w1 = new OutputStreamWriter (new FileOutputStream ("D: \ file1.txt"), "UTF-8 ");

InputStreamReader (stream, charset)

It helps us read and write files according to the specified encoding easily.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.