Character encoding is a binary representation of everyday control symbols, text, and common symbols. In order to accurately indicate how to number, how to produce eight-bit byte stream, the Unicode Technical report (UTR) #17提出现代编码模型的5个层次:
ASCII encoding was first introduced in 1961 and contains 128 characters, 95 of which can display characters and 33 control characters (such as line breaks). The standard ASCII code uses 7 bits in a byte to store 128 characters, the highest position 0,ascii in the usual coding just remember some typical content can be, such as 0 is the X, A is 65,a 97, the difference between the case 32. After the standard ASCII IBM extended the ASCII code for system reasons, the International Organization for Standardization then developed the ISO2022 standard, which specifies a unified method for extending the ASCII character set to 8-bit code while maintaining compatibility with ISO646. ISO has developed different extended ASCII codes for different regions.
To meet the needs of countries, countries began to develop relevant codes for the needs of their own text, most of which use two bytes to represent a character, the code is unified called ANSI, the more famous include gb2312 (Simplified Chinese character coding table), BIG5 (Traditional Chinese character encoding), GBK ( gb2312 extension table). The messy ANSI code is not compatible with each other, because an encoding represents different characters in different ANSI encodings. In this context, Unicode encoding has been created, based on the idea that all known words and symbols are incorporated into them.
Unicode maps these characters with 0 ~ 0x10ffff, one can store 1114112 (2^20 + 2^16) code bits, Unicode uses the first byte as a plane, there are currently 17 planes, planar 15,16 as a custom area reservation, and plane 0 has a dedicated Area: 0XE000-0XF8FF, there are 6,400 yards, while the plane 0 0xd800-0xdfff, a total of 2048 code bits, known as the agent area. Unicode is inefficient, so there are a variety of Unicode encoding methods, more famous utf-8,utf-16,utf-32, the following mainly introduced under UTF8 and UTF16.
UTF-8 is encoded in bytes for Unicode, and UTF-8 is encoded with different lengths for different ranges of code:
Unicode |
BYTE code |
000000 ~ 00007f |
0xxxxxxx |
000080 ~ 0007FF |
110xxxxx 10xxxxxx |
000800 ~ 007FFF |
1110xxxx 10xxxxxx 10xxxxxx |
008000 ~ 10FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
This encoding ensures that different byte encodings are used, and that the current character is made up of several bytes at the time of parsing by identifying the prefix of each byte.
UTF-16 is encoded using a 16-bit unsigned number as a code element. u+0000 ~ U+FFFF contains the commonly used characters, UTF-16 directly to the code bit of this range for the 16-bit code, for u+10000 to U+10FFFF code bit, encoded into 2 16-bit code, encoding method code is as follows:
/*** utf-16 coding algorithm, large end *@paramValue of offset code bit for *@return*/Public String convertUTF16 (IntOffset) {if (Offset < 0x10000) {int VH = (offset & (0xFF << 8)) >> 8;//Higt 8 bitsint VL = (offset & 0xFF);//Low 8 bitsReturn integer.tohexstring (VH) + "" +Integer.tohexstring (VL); }int val = Offset-0x10000; int vh = (val & (0x3ff <<)) >> 10; //higt bits int VL = ( Val & 0x3ff); //low bits int ph = 0xD800; // high proxy int PL = 0xdc00;< Span style= "color: #008000;" >// status agent String Firstsymbol = integer.tohexstring (ph + VH ); String secsymbol = integer.tohexstring (pl + VL); return Firstsymbol + "+ Secsymbol;}
For multiple byte representations, there is also the problem of byte order (Big-endian, Little-endian), which is distinguished by the BOM (byte order mark), which transmits the BOM before the stream is transmitted. The Unicode encoding format corresponds to the BOM listed below:
UTF encoding method |
Bom |
Utf-8 |
EF BB BF |
Utf-16 LE |
FF FE |
Utf-16 be |
FE FF |
Utf-32 LE |
FF FE 00 00 |
Utf-32 be |
XX-FE FF |
Note: Utf-8 does not require a BOM to represent its byte order due to its special encoding format
With regard to the processing of coding in Java, it is explained by several questions:
1. What is the byte order of Java, Big-endian or Little-endian?
Java is a big-endian, this problem can be verified by the program:
/*** Check byte order *@throwsUnsupportedencodingexception*/public String Checkbyteorder () throws Unsupportedencodingexception {String a = "a" ; byte[] arr = a.getbytes ("utf-16" ); if (arr[2] = 0) {// The first two bits are BOM return "Big-endian" ;} else if (arr[2] = = 97) { Span style= "color: #0000ff;" >return "Little-endian" ;} else {return "Error" Span style= "color: #000000;" >; } }
2. Character encoding method in Java?
In the Java platform, UTF-16 encoding is used in char[], String, StringBuilder, and StringBuffer classes, and BMP characters are represented by a char, and supplementary characters are represented by a pair of char.
3. How does the Java string.getbytes () method get the encoding used by bytes?
As shown in the API, the String.getbytes () method uses the default encoding, which is related to the platform, and can be used to obtain the default encoding for the platform as follows:
Charset.defaultcharset ()
Finally, the StringBufferInputStream method is not recommended because this class is an early Java class, and the Read method only gets the low 8 bits of char, which can cause a lot of problems
int Read () { return (POS < count)? (Buffer.charat (pos++) & 0xFF):-1;}