String. Length () and string. getbytes (). Length

Source: Internet
Author: User

String. Length ()

Returns the number of characters in a string. One Chinese character is counted as one character;

String. getbytes (). Length

Returns the length of a string in two Chinese bytes;

The getbytes () method of string is used to obtain the byte array of a string, which is well known. However, this method returns the default encoding format of the operating system.
Section array. If you do not take this into consideration when using this method, you will find that a system that runs well on one platform will have unexpected problems after it is placed on another machine. For example, the following program:

Class testcharset
{

Public static void main (string [] ARGs)
{
New testcharset(cmd.exe cute ();
}

Private void execute (){
String S = "Hello! Hello! ";

Byte [] bytes = S. getbytes ();

System. Out. println ("bytes lenght is:" + bytes. Length );
}

}

In a Chinese Windows XP system, the result is:

Bytes lenght is: 12

However, if you run it in an English Unix environment:

$ Java testcharset
Bytes lenght is: 9

If your program depends on this result, it will cause problems in subsequent operations. Why is the result of 12 in one system and 9 in the other? As mentioned above, this method is related to the platform (encoding.

In the Chinese operating system, the getbytes method returns a byte array encoded in GBK or gb2312 Chinese. The Chinese Character occupies two bytes respectively. On the English platform, the general default encoding is "ISO-8859-1", each character takes only one byte (regardless of whether it is not a Latin character ).

Java encoding support

JAVA supports multi-country encoding. in Java, all characters are stored in Unicode. For example, the Unicode encoding of "you" is "4f60 ", we can verify it through the following experiment code:

Class testcharset
{

Public static void main (string [] ARGs)
{
Char c = 'you ';
Int I = C;
System. Out. println (C );
System. Out. println (I );
}

}

No matter which platform you execute the command, the same output will exist:

20320

20320 is the integer of Unicode "4f60. In fact, you can decompile the above class and find that the character "you" (or any other Chinese string) in the generated. Class file is stored in Unicode encoding:

Char c = '/u4f60 ';
......

Even if you know the encoding format, for example:

Javac-encoding GBK testcharset. Java

The. Class file generated after compilation still stores Chinese characters or strings in unicode format. Use the string. getbytes (string charset) Method
To avoid this problem, we recommend that you use the string. getbytes (string charset) method in encoding. Next we will extract the byte arrays of the ISO-8859-1 and GBK encoding formats from the string to see what will happen:

Class testcharset
{

Public static void main (string [] ARGs)
{
New testcharset(cmd.exe cute ();
}

Private void execute (){
String S = "Hello! Hello! ";

Byte [] bytesiso8859 = NULL;
Byte [] bytesgbk = NULL;

Try
{
Bytesiso8859 = S. getbytes ("iso-8859-1 ");
Bytesgbk = S. getbytes ("GBK ");
}
Catch (Java. Io. unsupportedencodingexception E)
{
E. printstacktrace ();
}

System. Out. println ("--------------/n 8859 Bytes :");
System. Out. println ("bytes is:" + arraytostring (bytesiso8859 ));
System. Out. println ("hex format is:" + encodehex (bytesiso8859 ));
System. Out. println ();
System. Out. println ("--------------/n GBK Bytes :");
System. Out. println ("bytes is:" + arraytostring (bytesgbk ));
System. Out. println ("hex format is:" + encodehex (bytesgbk ));
}

Public static final string encodehex (byte [] bytes)
{
Stringbuffer buff = new stringbuffer (bytes. length * 2 );
String B;
For (INT I = 0; I <bytes. length; I ++)
{
B = integer. tohexstring (Bytes [I]);
// Byte is two bytes, And the integer. tohexstring above will expand the byte to four bytes.
Buff. append (B. Length ()> 2? B. substring (6, 8): B );
Buff. append ("");
}
Return buff. tostring ();
}

Public static final string
Arraytostring (byte [] bytes)
{
Stringbuffer buff = new stringbuffer ();
For (INT I = 0; I <bytes. length; I ++)
{
Buff. append (Bytes [I] + "");
}
Return buff. tostring ();
}

}

Execute the above program and print out:

--------------
8859 Bytes:
Bytes is: 72 101 108 108 111 33 63 63
Hex format is: 48 65 6C 6C 6f 21 3f 3f 3f

--------------
GBK Bytes:
Bytes is: 72 101 108 108 111 33
-60-29-70-61-93-95
Hex format is: 48 65 6C 6C 6f 21 C4 E3 Ba C3 A3 A1

It can be seen that the length of the byte array in the 8859-1 format extracted in S is 9, the Chinese characters are all changed to "63", and the ASCII code 63 is "?", Some foreign programs run in Chinese environments in China, and garbled characters often appear. This is because the encoding is not correctly processed.

The extracted GBK encoded byte array correctly obtains the GBK encoding of Chinese characters. Character "you" "good" "!" The GBK encoding is "c4e3" "bac3" "a3a1 ". The correct byte array encoded in GBK is obtained. You can use the following method to restore it to a Chinese string:

New String (byte [] bytes, string charset)

 

Address: http://hi.baidu.com/gengshenspirit/blog/item/0398810f896047e8ab6457bc.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.