About Unicode and encoding in Java

Source: Internet
Author: User
My understanding of Unicode and encoding in Java

We often encounter Encoding Problems. Java is known as the international language, because its class file uses UTF-8, while the JVM runtime uses UTF-16 (as to why the JVM to use UTF-16, I have not read the relevant information, but I guess it may be because a character (char) in Java is 16 bits, and the UTF-16 is dubyte encoding), are unicode encoding.

The goal of Unicode is to support all character sets in the world. That is to say, almost all character sets contain the corresponding encoding in Unicode. In Unicode, The ing between characters and codes is the Unicode Character Set, known as the Unicode Character Set. Each UNICODE character encoding is called a Code Point ?). The UTF-8 and UTF-16 are different from the UCS encoding method, and UTF is the UCS Transformation format .;

In Java, the getbytes () method of string is to encode a specific string (UNICODE) according to the given character set (encode), new string () then, bytes can be transferred back to Unicode (decode) according to a character set ). Every string in Java is unicode encoded.

Let's take a look at the page. If no special processing is performed, the form will be submitted according to the character set in the contenttype settings of the page for encoding conversion and sent to the background. The background must use req. setcharacterencoding is used to specify the encoding format of the parameter (different application servers should use different methods) to correctly decode the parameter.

In Java, both encode and decode are relative to Unicode. encode means to convert char [] --> XXX encoding byte [], decode is composed of XXX encoding byte [] --> char []. Normally, when we say "Convert GBK encoding to UTF-8 encoding", the actual meaning is: GBK encoding byte [] --> UTF-8 encoding byte [], this type of conversion is meaningful only when byte [] is required for data transmission. Otherwise, it is meaningless.

The first thing to note is that the string object in Java is a Unicode-encoded string.

However, we usually hear someone say, "We need to convert string from ISO-8859-1 to GBK encoding." What is the problem? In factNoTo "convert a string encoded by a ISO-8859-1 to a string encoded by GBK", it is repeated that the strings in Java are unicode encoded, therefore, there is no such statement as "ISO-8859-1 encoded string" or "GBK encoded string. The only reason for conversion is that the string is incorrectly encoded. We often encounter the need to convert from ISO-8859-1 to such as GBK/UTF-8. The so-called conversion process is: String --> byte [] --> string.
Maybe you are very clear about the code for this process: New String (text. getbytes ("ISO-8859-1"), "GBK "). However, it is not that easy to understand. On the surface, it seems easy to understand, isn't it to encode the text string object as byte [] In ISO-8859-1 and then convert it to string in GBK mode? But this code can easily be misunderstood as "converting text string from ISO-8859-1 to GBK encoding", which is wrong. Have you ever seen code like this: New String (text. getbytes ("GBK"), "UTF-8") to encode and convert string?

The new string (text. the code for getbytes ("ISO-8859-1"), "GBK") is because a GBK byte stream is mistakenly converted to string (UNICODE) in ISO-8859-1! The most common cause is that when a GBK-encoded webpage submits data to the background, it is possible to see the code output. The GBK stream is mistakenly treated as the ISO8859-1 stream, so it gets a wrong string. Since the ISO8859-1 is single-byte encoding, each byte is converted to string as is, that is, although this is an incorrect conversion, the encoding has not changed, so we still have the opportunity to convert the encoding back! So the classic New String (text. getbytes ("ISO-8859-1"), "GBK") appears.

If the system mistakenly thinks it is another encoding format, it may not be converted again, because the encoding conversion is not as simple as negative.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.