Recently I read core Java, but I have never understood it before. I have read many posts and blogs and finally understood it.
In Java, the char [], String, stringbuilder and stringbuffer class adopts UTF-16 encoding, using U + 0000 ~ U + FFFF represents a basic character (BMP character), but Char located in U + d800 to U + dbff and U + dc00 to U + dfff is considered as non-defining characters. Most common Unicode characters can be expressed by a single code unit, while auxiliary characters need to be represented by a pair of code units. That is, the basic character is represented by a char, and the secondary character is represented by a pair of char.
Java uses the Unicode code pointer to indicate the character values (INT type) in the range of U + 0000 and U + 10ffff, and the code unit (UNICODE code unit) represents the 16-bit char value (char type) of the code unit encoded as the UTF-16 ). That is to say, there may be a character whose number of code points is 1 and the number of code units is 2. Therefore, the number of code units is not necessarily the number of characters.
In contrast, code units are more at the bottom layer.
Related functions:
The length () function returns the number of code units required for a given string identified by a UTF-16.
The codepointcount () function returns the number of code points required for a given string identified by a UTF-16.
Test procedure:
Package COM. xujin; public class test {public static void main (string... ARGs) {char [] CH = character. tochars (0x10400); system. out. printf ("U + 10400 high proxy character: % 04x \ n", (INT) CH [0]); // d801 system. out. printf ("U + 10400 low proxy character: % 04x \ n", (INT) CH [1]); // dc00 string STR = new string (CH); system. out. println ("code unit length:" + Str. length (); // 2 system. out. println ("Number of code points:" + Str. codepointcount (0, str. length (); // 1system. out. P Rintln (Str. codepointat (0); // return the code point starting or ending at a given position, 66560system. out. println (Str. charat (1); // return the code unit at the given position. Because it is not defined,? // Print the code of all characters by traversing a string. The Code point is str + = "Hello, world! "; Int I = 0; int CP = Str. codepointat (I); While (I <Str. length () {system. out. println (Str. codepointat (I); If (character. issupplementarycodepoint (CP) I + = 2; // if the position of the CP is the first part of the code point, run else I ++ here ;} /** 66560*72*108*111*119*114*100 */}}
References:
Http://bbs.csdn.net/topics/340195349
Http://blog.csdn.net/longyulu/article/details/7374862