Recently in a feedback function, the data feedback to the other company's website, my company is GBK code, the other company is UTF-8 code. Therefore, I need to convert GBK encoded data into UTF-8 encoded data so that the other site is not garbled. The simplest approach is to set HttpClient's Contentcharset to Utf-8, and if Contentcharset is GBK and does not want to be set to Utf-8, then the data needs to be converted to UTF-8 encoding and sent to the other site.
Problem occurs: GBK turn UTF-8, odd number of Chinese will garbled, even number of Chinese will not garbled.
Three Chinese
Java code
- Public static void Encodeerror () throws unsupportedencodingexception {
- String GBK = "I'm Coming";
- String UTF8 = new string (Gbk.getbytes ("UTF-8"));
- //Analog UTF-8 encoded website display
- System.out.println (New String (Utf8.getbytes (),"UTF-8"));
- }
- /*
- I got it??
- */
The first three Chinese, the back one Chinese, are all odd
Java code
- Public static void EncodeError2 () throws unsupportedencodingexception {
- String GBK = "This year is 2011";
- String UTF8 = new string (Gbk.getbytes ("UTF-8"));
- //Analog UTF-8 encoded website display
- System.out.println (New String (Utf8.getbytes (),"UTF-8"));
- }
- /*
- This?? 011??
- */
Why: Why only odd number of Chinese is garbled, even the number of not garbled? Below to analyze the cause
Java code
- Public static void analyze () throws unsupportedencodingexception {
- String GBK = "I'm Coming";
- String UTF8 = new string (Gbk.getbytes ("UTF-8"));
- For (byte b:gbk.getbytes ("UTF-8")) {
- System.out.print (b + "");
- }
- System.out.println ();
- For (byte b:utf8.getbytes ()) {
- System.out.print (b + "");
- }
- }
- /*
- -26-120-111-26-99-91-28-70-122
- -26-120-111-26-99-91-28-70 63
- */
Note that the last byte is different, and the line above is the correct UTF-8 encoding. So why is the last byte in the next line 63, not 122? This is the cause of garbled characters.
The GBK encoding is a Chinese 2 byte, and the UTF-8 encoding is a Chinese 3 byte, when we call the GetBytes ("UTF-8") method, the byte is incremented by the calculation, making 2 bytes from GBK into UTF-8 corresponding 3 bytes. Therefore, the previous example of 3 Chinese output 9 bytes.
Here's how to add bytes by calculation, and readers who don't drill down can skip this paragraph. In order to be bold, directly in the code to explain
Java code
- Public static void Gbk2utf () throws unsupportedencodingexception {
- String GBK = "I'm Coming";
- char[] C = Gbk.tochararray ();
- byte[] Fullbyte = new byte[3*c.length];
- For (int i=0; i<c.length; i++) {
- String binary = integer.tobinarystring (C[i]);
- StringBuffer sb = new StringBuffer ();
- int len = 16-binary.length ();
- //Front complement 0
- For (int j=0; j<len; j + +) {
- Sb.append ("0");
- }
- Sb.append (binary);
- //Increase bit to 24 bits to 3 bytes
- Sb.insert (0, "1110");
- Sb.insert (8, "10");
- Sb.insert ( 10);
- fullbyte[i*3] = integer.valueof (sb.substring (0, 8), 2). Bytevalue (); Binary string Create integral type
- fullbyte[i*1] = integer.valueof (sb.substring (8, ), 2). Bytevalue ();
- fullbyte[i*2] = integer.valueof (sb.substring (+ ), 2). Bytevalue ();
- }
- //Analog UTF-8 encoded website display
- System.out.println (new String (Fullbyte,"UTF-8"));
- }
Now let's find out why the last byte is 63, not 122.
Java code
- public static void analyze2 () throws unsupportedencodingexception {
- string gbk =
- byte[] Utfbytes = gbk.getbytes ( "UTF-8");
- string utf8 = new string (utfbytes); //the problem is in this
- system.out.print (UTF8);
- }
- /*
- contact language again Bang?
- */
Because the file is GBK encoded, the new string (utfbytes) defaults to the new string (Utfbytes, "GBK"). It converts 2 bytes 2 bytes to characters, and when the byte is odd, the last 1 bytes of the turn character will calculate the error, and then give the last character directly, the corresponding ASCII code is 63.
Solve the problem
It is the hard truth to ensure that the byte is correct. When the call to GetBytes ("UTF-8") is converted to a byte array, the ISO-8859-1 encoded string is created, and the Iso-8859-1 encoding is a byte corresponding to one character, so the last byte error is not made.
Java code
- Public static void Correctencode () throws unsupportedencodingexception {
- String GBK = "I'm Coming";
- String ISO = new string (Gbk.getbytes ("UTF-8"),"iso-8859-1");
- For (byte b:iso.getbytes ("iso-8859-1")) {
- System.out.print (b + "");
- }
- System.out.println ();
- //Analog UTF-8 encoded website display
- System.out.println (New String (Iso.getbytes ("iso-8859-1"),"UTF-8"));
- }
- /*
- -26-120-111-26-99-91-28-70-122
- I'm coming
- */
http://www.iteye.com/topic/1097560
Understand and solve GBK UTF-8 odd Chinese garbled (turn)