Knowledge Points:
Through byte[] bytes= "xxxx". GetBytes ("Utf-8") gets the string parsed into a byte array via Utf-8. In the UTF-8 encoding format, the computer uses 1 bytes to store characters in the ASCII range, with 3 bytes in the stored Chinese character.
UTF-8 is a variable-length byte encoding method. For the UTF-8 encoding of a character, if there is only one byte, its maximum bits is 0, if it is multibyte, its first byte starts at the highest bit, the number of consecutive bits values is 1 determines the number of digits encoded, and the remaining bytes begin with 10. The UTF-8 can be up to 6 bytes.
As table:
1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 1110xxxx 10xxxxxx 10xxxxxx
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 bytes 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Note: When the UTF-8 encoding in the calculation stores multibyte characters, the first number of the 8 bits is not used as the sign bit, such as direct output, resulting in negative numbers.
byte[] BSS = "This is a magical world". GetBytes ("Utf-8"); SYSTEM.OUT.PRINTLN ("BSS length:" +bss.length);//output: 27, one Chinese with three bytes of storage. //output: -24-65-103-26-104-81-28-72-128-28-72-86-25-91-98-27-91-121-25-102-124-28-72-106-25-107 -116 for (byte b:bss) { System.out.print (b + "");}
To correctly obtain the actual encoded value represented by each byte. This can be done in the following way. (Need to understand the displacement operation, the original code, anti-code, complementary knowledge)
1. Decimal
byte [] BSS = "This is a magical world". GetBytes ("Utf-8"); System.out.println ("BSS length:" +bss.length); // output: 27, one Chinese with three bytes of storage. // for (byte b:bss) { System.out.print (integer.valueof (b&0xff) +" "); }
2.16 Binary
byte [] BSS = "This is a magical world". GetBytes ("Utf-8"); System.out.println ("BSS length:" +bss.length); // output: 27, one Chinese with three bytes of storage. // for (byte b:bss) { System.out.print (integer.tohexstring (b& 0xFF) + ""); }
3. Binary
byte[] BSS = "This is a magical world". GetBytes ("Utf-8"); System.out.println ("BSS length:" +bss.length);//output: 27, one Chinese with three bytes of storage. //output: 11101000 10111111 10011001 11100110 10011000 10101111 11100100//10111000 10000000 11100100 10111000 10101010 11100111 10100101 10011110//11100101 10100101 10000111 11100111 10011010 10000100 11100100//10111000 10010110 11100111 10010101 10001100 for(byteB:BSS) {System.out.print (integer.tobinarystring (b&0XFF) + ""); }
Practice: Mixed string interception in Chinese and English
* By passing in the string and byte-count, intercept the string according to the number of bytes, utf-8 the non-English characters occupy multiple bytes,
* The last truncated character should be discarded if the intercept position is in the middle of a non-English character.
Public classStrtruncate { Public Static voidMain (string[] args)throwsunsupportedencodingexception {Scanner Scanner=NewScanner (system.in); System.out.println ("Input (string, number of bytes)"); String Inputstr=Scanner.nextline (); String Sub=NewStrtruncate (). GETSUBSTR (Inputstr.split (",") [0], integer.valueof (Inputstr.split (",") [1])); System.out.println ("The truncated string is:" +sub); } PublicString Getsubstr (string resource,intCharlen)throwsunsupportedencodingexception {if(Charlen <= 0) { return NULL; } byte[] bytes = Resource.getbytes ("Utf-8"); if(Bytes[charlen] < 0) { while(! Integer.tobinarystring (Bytes[charlen] & 0xff). StartsWith ("11") ) {Charlen--; }} String subStr=NewString (bytes, 0, Charlen, "Utf-8"); returnsubStr; }}
The results of the implementation are as follows:
The storage method for UTF-8 format strings in Java.