Java Unicode goto GBK

Last Update:2015-03-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We often encounter coding problems. Java is known as the international language because its class file is UTF-8, and the JVM runs with UTF-16 (as for why the JVM uses UTF-16, I have not read the relevant data, but I guess it is because Java is a character (char) is a 16-bit, UTF-16 is a double-byte encoding, which is Unicode encoding.

The goal of Unicode is to support all character sets in the world, meaning that almost all character sets contain characters that have corresponding encodings in Unicode. In Unicode, the mapping of characters to code is the Unicode character set, called the UCS (Unicode Character set), and each Unicode character encoding is called a code point. ）。 UTF-8 and UTF-16 are different UCS encoding methods, UTF is UCS transformation Format.;

In Java, the GetBytes () method of string is to encode a specific string (Unicode) according to a given character set (encode), and new string () to swap byte streams back to Unicode (decode) in a character set. Every string in Java is Unicode encoded.

Again to see the page, if you do not do special processing, the submission of the form according to the ContentType settings in the page character set encoding conversion, sent to the background, the background must use Req.setcharacterencoding to specify the parameters of the encoding format ( Different application servers should be specified in different ways to decode correctly.

Java encode and decode are all relative to Unicode, encode means will char[]--xxx Encoding byte[],decode is by xxx Encoding byte[]-- Char[]. Normally, when we say "convert GBK code to UTF-8 code", the actual meaning is: GBK Encoding byte[]--UTF-8 Encoding byte[], this conversion only when the need to use byte[] transfer data, it is meaningful, Otherwise there is no point.

The first point to note is that the string object in Java is a Unicode-encoded string.

However, we usually hear someone say, "We need to convert string from iso-8859-1 to GBK code", what's going on? In fact, we are not going to "convert a string encoded by iso-8859-1 into a GBK encoded string", and it is repeatedly stated that the string in Java is Unicode encoded, so there is no "iso-8859-1 encoded string" or The phrase "GBK encoded string". The only reason for the conversion is that the string was incorrectly encoded. We often encounter the need to convert from iso-8859-1 to such things as gbk/utf-8 and so on. The so-called conversion process is: String---byte[]-->string.

Perhaps you know very well the code for this process: New String (Text.getbytes ("iso-8859-1"), "GBK"). But it's not that simple to really understand. On the surface it seems easy to understand, not just to encode the text string object as iso-8859-1 as byte[] and then convert it to a string in the GBK way? But this code can easily be misunderstood as: "Converting text string from iso-8859-1 to GBK encoding" is wrong. Have you ever seen this code: new String (Text.getbytes ("GBK"), "UTF-8") to encode a string for conversion?

You will often see new String (Text.getbytes ("iso-8859-1"), "GBK" as the code, because a GBK byte stream is incorrectly converted to String (Unicode) in iso-8859-1 way! The most common place where this happens is when a GBK-encoded webpage submits data to the background, it is possible to see this code appear. The GBK stream is incorrectly treated as a iso8859-1 stream, so it gets a wrong string. Since Iso8859-1 is a single-byte encoding, each byte is converted to a string as is, that is, although this is a wrong conversion, the encoding does not change, so we still have the opportunity to convert the code back! So the classic new String (Text.getbytes ("iso-8859-1"), "GBK", appears.

If the system is mistaken for another encoding format, it is possible to convert it back again, because the encoding conversion is not as simple as negative negative.

public class unicode2gb{public static void Main (string[] arg) {String str = "\u53d6             ";         System.out.println (str); }     }

The output is automatically converted to GB code, and it is also possible to add a conversion:

 public   class   Unicode2GB{          public   static   void   main (String[]   &NBSP;ARG) {           try{                String   str    =    "\u53d6";              str   =   new   string (Str.getbytes (), "gb2312");              system.out.println (str);            }catch (java.io.unsupportedencodingexception    e) {           }                }     }}

Java Unicode goto GBK

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Unicode goto GBK

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java Unicode goto GBK

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support