http://maimode.iteye.com/blog/1341354
has been confused in the use of, and always did not understand the Chinese characters in Java occupies the number of bytes, each time is evaded the past. Today, a byte-encoding problem has made me have to rethink the problem of char and encoding.
Here is a reference to the information in the discussion:
Http://www.iteye.com/topic/47740 writes (because the original author confused byte and bit writes, I have made changes when referencing)
looks like a simple question (perhaps really simple) but it's confusing me with what I thought was clear.
Wonderful
Char in Java should be 16 bit
Byte in Java should be 8 bit
Char x = ' edit ';/This is legal, output is 16 bit
But
String str = "; "
byte[] bytes = Str.getbytes ();//I don't understand why it takes 3 bytes.?
3 byte altogether is 3*8 = 24 bit, so char X can be put down again. I believe char is 16 bit,
but Str.getbytes () What's going on in the end.
I'm sorry to say it's a bit messy, but it's really weird. I hope you can enlighten us. Skydream wrote First, a char in Java is really 2 bytes. Java uses a unicode,2 byte to represent a character.
Second, the landlord you say the byte[] bytes = Str.getbytes (), followed by 3 bytes, here and the previous concept is not the same. Java is used to represent characters in Unicode, and the Unicode for this Chinese character is 2 bytes. The String.getbytes (encoding) method is to get the specified encoded byte array representation, usually gbk/gb2312 is 2 bytes, and the Utf-8 is 3 bytes. If encoding is not specified, the system default encoding is taken.
The
kdekid writes, first, to understand the difference between the code point and the encoding. Java is followed by the Unicode 4.0 standard, while the internal character is utf-16 as encoding. The Unicode 4.0 standard contains text from U+0000-U+FFFF's basic multilingual plane and u+10000-u+10ffff extension plane, which is the code point. The Java char type is bit, so a single char supports only the text within the base plane, and the text of the extended plane is represented by a pair of char.
and String.getbytes () This method is to return a string according to the specified encoding, the default encoding for the general Chinese system is utf-8 (Linux, Mac) or gbk/gb18030 (Windows). As long as the text within the basic plane, Utf-8 code in Chinese is 3 bytes, and gbk/gbk18030 is 2 bytes.