Java string.getbytes () encoding Problem--string.getbytes (CharSet)

Source: Internet
Author: User



The GetBytes () method of the string is to get a string of byte arrays, which is well known. However, it is important to note that this method returns a byte array of the operating system's default encoding format. If you do not take this into account when using this method, you will find that running a good system on one platform can cause unexpected problems when placed on another machine. For example, the following program:


Class Testcharset {
public static void Main (string[] args) {
New Testcharset (). Execute ();
}
private void Execute () {
String s = "hello! Hello!" ";
byte[] bytes = S.getbytes ();
System.out.println ("bytes Lenght is:" + bytes.length);
}
}



In a Chinese windowsxp system, the result is:


Bytes Lenght Is:12


But if you put it in an English-language UNIX environment, run it:


$ java testcharset bytes Lenght Is:9


If your program relies on that result, it will cause problems in subsequent operations. Why is the result in one system 12, while the other one becomes 9? As mentioned above, this method is related to the Peace Table (coding).



In the Chinese operating system, the GetBytes method returns a Chinese-encoded byte array of GBK or GB2312, where the characters are two bytes each. In the English platform, the general default encoding is "Iso-8859-1", with each character taking only one byte (regardless of whether it is not a Latin-Latin character).



Encoding support in Java



Java is a multi-country encoding, in Java, characters are stored in Unicode , for example, the "You" word Unicode encoding is "4f60", we can use the following experimental code to verify:



class TestCharset { 
public static void main(String[] args) { 
char c = ‘你‘; 
int i = c; 
System.out.println(c); 
System.out.println(i); 
} 
}




Regardless of whether you execute on any platform, you will have the same output:





20320


20320 is the integer value of the Unicode "4f60". In fact, you can decompile the above class to find that the character "you" (or any other Chinese string) in the generated. class file itself is stored in Unicode encoding:





char c = '/u4f60 '; ... ...


Even if you know the encoded encoding format, for example:





Javac-encoding GBK Testcharset.java


The. class file that is generated after compilation still stores Chinese characters or strings in Unicode format. Using the String.getbytes (String CharSet) method



So, to avoid this problem, I recommend that you use the String.getbytes (String CharSet) method in your code. Below we will extract the byte arrays of iso-8859-1 and GBK in two encoded formats from the string to see what happens:


 
 
package org.bruce.file.handle.experiment;  
  
class TestCharset3 {  
    public static void main(String[] args) {  
        new TestCharset3().execute();  
    }  
  
    private void execute() {  
        String s = "Hello!你好!";  
        byte[] bytesISO8859 = null;  
        byte[] bytesGBK = null;  
        try {  
            bytesISO8859 = s.getBytes("iso-8859-1");  
            bytesGBK = s.getBytes("GBK");  
        } catch (java.io.UnsupportedEncodingException e) {  
            e.printStackTrace();  
        }  
        System.out.println("-------------- /n 8859 bytes:");  
        System.out.println("bytes is: " + arrayToString(bytesISO8859));  
        System.out.println("hex format is:" + encodeHex(bytesISO8859));  
        System.out.println();  
        System.out.println("-------------- /n GBK bytes:");  
        System.out.println("bytes is: " + arrayToString(bytesGBK));  
        System.out.println("hex format is:" + encodeHex(bytesGBK));  
    }  
  
    public static final String encodeHex(byte[] bytes) {  
        StringBuffer buff = new StringBuffer(bytes.length * 2);  
        String b;  
        for (int i = 0; i < bytes.length; i++) {  
            b = Integer.toHexString(bytes[i]);  
            // byte是两个字节的, 而上面的Integer.toHexString会把字节扩展为4个字节  
            buff.append(b.length() > 2 ? b.substring(6, 8) : b);  
            buff.append(" ");  
        }  
        return buff.toString();  
    }  
  
    public static final String arrayToString(byte[] bytes) {  
        StringBuffer buff = new StringBuffer();  
        for (int i = 0; i < bytes.length; i++) {  
            buff.append(bytes[i] + " ");  
        }  
        return buff.toString();  
    }  
}  





Executing the above program will print out:





--------------8859 bytes:bytes is:72 101 108 108 111 33 63 63 63
Hex Format is:48 6c 6c 6f 3f 3f 3f
--------------GBK bytes:bytes is:72 101 108 108 111 33-60-29-70-61-93-95
Hex Format is:48 6c 6c 6f C4 E3 ba c3 a3 A1


It can be seen that in S, the 8859-1-format byte array length is 9, the Chinese characters have become "", ASCII 63 is the "?", some foreign programs in the domestic Chinese environment when running, often garbled, covered with "?", is because the code is not properly processed results.



The GBK encoding of Chinese characters is correctly obtained by extracting the GBK encoded byte array. The character "You" "good"! The GBK codes are: "C4e3", "Bac3", "A3A1", respectively. You can use the following method when you have the correct byte array encoded in GBK and later need to revert to Chinese text strings:





New string (byte[] bytes, string charset)






MySQL does not support Unicode, so it is more troublesome.
Set connectionString to encoding to gb2312
String connectionString
= "jdbc:mysql://localhost/test?useunicode=true&characterencoding=gb2312";
  
Test code:
String str = "Kanji";
PreparedStatement pstmt = conn.preparestatement ("INSERT into Test VALUES (?)");
Pstmt.setstring (1,STR);
Pstmt.executeupdate ();
  
Database tables:
CREATE TABLE Test (
Name Char (10)
)
  
  
Connect to Oracle Database Server
-------------------------------------------------------------------------------
Before inserting the character string into the database, do the following conversion:
String (Str.getbytes ("Iso8859_1"), "gb2312")
  
Test code:
String str = "Kanji";
PreparedStatement pstmt = conn.preparestatement ("INSERT into Test VALUES (?)");
Pstmt.setstring (1,new String (str.getbytes ("Iso8859_1"), "gb2312");
Pstmt.executeupdate ();
  
  
Servlet
-------------------------------------------------------------------------------
Add two words at the beginning of the Servlet:
Response.setcontenttype ("Text/html;charset=utf-8");
Request.setcharacterencoding ("UTF-8");
  
Jsp
-------------------------------------------------------------------------------




1 @Test2  Public voidtestbytes () {3 //Number of bytes4 //English: iso:1 gbk:2 utf-8:35 //number or Letter: iso:1 gbk:1 utf-8:16String username = "Medium"; 7 Try { 8 //gets the specified encoded byte array string---> byte array9 byte[] U_iso=username.getbytes ("Iso8859-1"); Ten byte[] U_gbk=username.getbytes ("GBK");  One byte[] U_utf8=username.getbytes ("Utf-8");  A System.out.println (u_iso.length);  - System.out.println (u_gbk.length);  - System.out.println (u_utf8.length);  the //It's just backwards, byte array----> String -String un_iso=NewString (U_iso, "iso8859-1");  -String un_gbk=NewString (U_GBK, "GBK");  -String un_utf8=NewString (U_utf8, "Utf-8");  + System.out.println (Un_iso);  - System.out.println (UN_GBK);  + System.out.println (Un_utf8);  A //Sometimes it must be an ISO character encoding type, which is handled as follows atString un_utf8_iso=NewString (U_utf8, "iso8859-1");  - //To restore an ISO-encoded string -String un_iso_utf8=NewString (Un_utf8_iso.getbytes ("iso8859-1"), "UTF-8");  - System.out.println (Un_iso_utf8);  -  -}Catch(unsupportedencodingexception e) { in //TODO auto-generated Catch block - E.printstacktrace ();  to  }  +}


Test results:






1
2
3
?
In
In
ĸ
In



From the reproduced article excerpt:



Garbled reason: Why use iso8859-1 encoding and then combination, can not restore the word "medium", in fact, the reason is very simple, because iso8859-1 encoded in the encoding table, there is no Chinese characters, of course, can not pass the "medium". GetBytes ("iso8859-1"); To get the correct "medium" in the iso8859-1 of the encoded value, so again through the new String () to restore it is impossible to talk about.



Sometimes, in order for Chinese characters to accommodate certain special requirements (such as HTTP header headers requiring their content to be iso8859-1 encoded), it is possible to encode Chinese characters in bytes, such as:
String s_iso88591 = new String ("Medium". GetBytes ("UTF-8"), "iso8859-1"), so that the resulting s_iso8859-1 string is actually three characters in Iso8859-1, After these characters are passed to the destination, the destination program passes the reverse way of string S_utf8 = new String (S_iso88591.getbytes ("iso8859-1"), "UTF-8") to get the correct Chinese kanji "medium". This guarantees both compliance with the Agreement and the support of Chinese.



Transferred from: http://blog.csdn.net/yang3wei/article/details/7381578 http://blog.csdn.net/techbirds_bao/article/details/ 8555449



Java string.getbytes () encoding Problem--string.getbytes (CharSet)


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.