In-depth analysis of the getBytes () encoding problem in java String, stringgetbytes

Source: Internet
Author: User

In-depth analysis of the getBytes () encoding problem in java String, stringgetbytes

Reprinted please indicate the source: http://www.cnblogs.com/Joanna-Yan/p/6900536.html

When the background of the Java Server communicates with the Android App, there is a problem between the two sides that use MD5 to encrypt the same string containing Chinese characters.

Problem description:

The Java Server and Android end AS use the same MD5 tool class, and the default encoding of both projects are UTF-8, encrypted strings of pure English numbers, the results are consistent, encryption of the same string containing Chinese characters leads to different results. Why?

The tool-type MD5Util code is as follows:

Public class MD5Util {/*** converts the byte array into a hexadecimal output * @ param bytes * @ return */public static String convertByteToHexString (byte [] bytes) {String result = ""; for (int I = 0; I <bytes. length; I ++) {int temp = bytes [I] & 0xff; String tempHex = Integer. toHexString (temp); if (tempHex. length () <2) {result + = "0" + tempHex;} else {result + = tempHex ;}} return result ;} /*** MD5 encryption ** @ param message * @ return * @ throws UnsupportedEncodingException */public static String md5Jdk (String message) throws UnsupportedEncodingException {String temp = ""; try {MessageDigest md5Digest = MessageDigest. getInstance ("MD5"); byte [] encodeMD5Digest = md5Digest. digest (message. getBytes (); temp = convertByteToHexString (encodeMD5Digest);} catch (NoSuchAlgorithmException e) {e. printStackTrace ();} return temp ;}}

The final problem is:

The default coding of IDE projects is not the same as that of the platform runtime environment.

Changed to: message. getBytes ("UTF-8"); after can solve the problem of inconsistent Chinese encryption.

Next, let's take a closer look at the getBytes () method of String.

The getBytes () method of String is to get a byte array of a String, but note that this method returns the byte array of the default encoding format of the operating system. If you do not consider this when using this method, you will find that a system that runs well on one platform and put it on another machine will produce unexpected problems.

For example, the following program:

Public class TestCharset {public static void main (String [] args) {new testcharset(cmd.exe cute ();} private void execute () {String s = "Hello! Hello! "; Byte [] bytes = s. getBytes (); System. out. println (" bytes lenght is: "+ bytes. length );}}

In a Chinese Windows XP system, the result is:

bytes lenght is:12

However, if you run it in an English UNIX environment:

$ java TestCharset bytes lenght is:9

If your program depends on this result, it will cause problems in subsequent operations. Why is the result 12 in one system, but 9 in the other? As mentioned above,This method is related to the platform (encoding.

In the Chinese operating system, the getBytes method returns a byte array of GBK or GB2313 Chinese encoding, where Chinese characters each occupy two bytes. On the English platform, the general default encoding is "ISO-8859-1", each character takes only one byte (regardless of whether it is not a Latin character ).

Java encoding support

Java supports multi-country encoding. in Java, all characters are stored in Unicode. For example, the Unicode encoding of "you" is "4f60 ", we can verify it through the following experiment code:

Public class TestCharset {public static void main (String [] args) {char c = 'you'; int I = c; System. out. println (c); System. out. println (I );}}

No matter which platform you execute the command, the same output will exist:

20320

20320 is the integer of Unicode "4f60. In fact, you can decompile the above class and find that the character "you" (or any other Chinese string) in the produced. calss file is stored in Unicode encoding:

char c = '/u4F60'; ... ...

Even if you know the encoding format, for example:

javac -encoding GBK TestCharset.java

The. class file generated after compilation still stores Chinese characters or strings in Unicode format.

So. To avoid this problem, we recommend that you use the String. getBytes (String charset) method in encoding.

Next we will extract the byte arrays of the ISO-8859-1 and GBK encoding formats from the string to see what will happen:

Public class TestCharset {public static void main (String [] args) {new testcharset(cmd.exe cute ();} private void execute () {String s = "Hello! Hello! "; Byte [] bytesISO8859 = null; byte [] bytesGBK = null; try {bytesISO8859 = s. getBytes ("iso-8859-1"); bytesGBK = s. getBytes ("GBK");} catch (java. io. unsupportedEncodingException e) {e. printStackTrace ();} System. out. println ("--------------/n 8859 bytes:"); System. out. println ("bytes is:" + arrayToString (bytesISO8859); System. out. println ("hex format is:" + encodeHex (bytesISO8859); System. out. pri Ntln (); System. out. println ("--------------/n GBK bytes:"); System. out. println ("bytes is:" + arrayToString (bytesGBK); System. out. println ("hex format is:" + encodeHex (bytesGBK);} public static final String encodeHex (byte [] bytes) {StringBuffer buff = new StringBuffer (bytes. length * 2); String B; for (int I = 0; I <bytes. length; I ++) {B = Integer. toHexString (bytes [I]); // byte is two bytes, and the preceding Inte Ger. toHexString will extend the byte to four bytes buff. append (B. length ()> 2? B. substring (6, 8): B); buff. append ("");} return buff. toString ();} public static final String arrayToString (byte [] bytes) {StringBuffer buff = new StringBuffer (); for (int I = 0; I <bytes. length; I ++) {buff. append (bytes [I] + "");} return buff. toString ();}}

Execution result:

-------------- /n 8859 bytes:bytes is: 72 101 108 108 111 33 63 63 63 hex format is:48 65 6c 6c 6f 21 3f 3f 3f -------------- /n GBK bytes:bytes is: 72 101 108 108 111 33 -60 -29 -70 -61 -93 -95 hex format is:48 65 6c 6c 6f 21 c4 e3 ba c3 a3 a1 

It can be seen that the length of the byte array in the 8859-1 format extracted in s is 9, the Chinese characters are all changed to "63", and the ASCII code 63 is "?", Some foreign programs run in Chinese environments in China, and garbled characters often appear. This is because the encoding is not correctly processed.

The extracted GBK encoded byte array correctly obtains the GBK encoding of Chinese characters. Character "you", "good", "!" The GBK encoding is "c4e3", "bac3", and "a3a1 ". The correct byte array encoded in GBK is obtained. You can use the following method to restore it to a Chinese string:

new String(byte[] bytes, String charset)

If this article is helpful to you, please let me know ~

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.