Number of bytes in Java char and Chinese characters

Source: Internet
Author: User

http://maimode.iteye.com/blog/1341354


has been confused in the use of, and always did not understand the Chinese characters in Java occupies the number of bytes, each time is evaded the past. Today, a byte-encoding problem has made me have to rethink the problem of char and encoding.

Here is a reference to the information in the discussion:

Http://www.iteye.com/topic/47740 writes (because the original author confused byte and bit writes, I have made changes when referencing)

looks like a simple question (perhaps really simple) but it's confusing me with what I thought was clear.
Wonderful  

Char in Java should be 16 bit
Byte in Java should be 8 bit 
Char x = ' edit ';/This is legal, output is 16 bit

But  
String str = "; "
byte[] bytes = Str.getbytes ();//I don't understand why it takes 3 bytes.? 
3 byte altogether is 3*8 = 24 bit, so char X can be put down again. I believe char is 16 bit, 
but Str.getbytes () What's going on in the end.  

I'm sorry to say it's a bit messy, but it's really weird. I hope you can enlighten us.   Skydream wrote First, a char in Java is really 2 bytes. Java uses a unicode,2 byte to represent a character.

Second, the landlord you say the byte[] bytes = Str.getbytes (), followed by 3 bytes, here and the previous concept is not the same. Java is used to represent characters in Unicode, and the Unicode for this Chinese character is 2 bytes. The String.getbytes (encoding) method is to get the specified encoded byte array representation, usually gbk/gb2312 is 2 bytes, and the Utf-8 is 3 bytes. If encoding is not specified, the system default encoding is taken.

The

 kdekid writes, first, to understand the difference between the code point and the encoding. Java is followed by the Unicode 4.0 standard, while the internal character is utf-16 as encoding. The Unicode 4.0 standard contains text from U+0000-U+FFFF's basic multilingual plane and u+10000-u+10ffff extension plane, which is the code point. The Java char type is bit, so a single char supports only the text within the base plane, and the text of the extended plane is represented by a pair of char.  

and String.getbytes () This method is to return a string according to the specified encoding, the default encoding for the general Chinese system is utf-8 (Linux, Mac) or gbk/gb18030 (Windows). As long as the text within the basic plane, Utf-8 code in Chinese is 3 bytes, and gbk/gbk18030 is 2 bytes.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.