Introduction to character encoding and application in Java

Source: Internet
Author: User
Tags control characters

Character encoding is a binary representation of everyday control symbols, text, and common symbols. In order to accurately indicate how to number, how to produce eight-bit byte stream, the Unicode Technical report (UTR) #17提出现代编码模型的5个层次:

                     1.  Abstract Character Descriptor: A collection of all the abstract characters supported by the system

                     2. Coded Character set: It is through some rules to map abstract characters to a code point of the coded space

                     3. Character encoding table: Converts the code bit into a finite bit length integer value string, Utf-8, etc.

                     4. Character encoding scheme: convert fixed-length integers to 8bit

                     5. Transfer encoding syntax: In order to meet the needs of transmission, further processing of byte stream, base64 belongs to this layer. The

character encoding is also evolving with the development of computer science, information Science, from the original standard ASCII, to the later extended ASCII, to various countries of the code, and then to the current unified Unicode encoding.

ASCII encoding was first introduced in 1961 and contains 128 characters, 95 of which can display characters and 33 control characters (such as line breaks). The standard ASCII code uses 7 bits in a byte to store 128 characters, the highest position 0,ascii in the usual coding just remember some typical content can be, such as 0 is the X, A is 65,a 97, the difference between the case 32. After the standard ASCII IBM extended the ASCII code for system reasons, the International Organization for Standardization then developed the ISO2022 standard, which specifies a unified method for extending the ASCII character set to 8-bit code while maintaining compatibility with ISO646. ISO has developed different extended ASCII codes for different regions.

To meet the needs of countries, countries began to develop relevant codes for the needs of their own text, most of which use two bytes to represent a character, the code is unified called ANSI, the more famous include gb2312 (Simplified Chinese character coding table), BIG5 (Traditional Chinese character encoding), GBK ( gb2312 extension table). The messy ANSI code is not compatible with each other, because an encoding represents different characters in different ANSI encodings. In this context, Unicode encoding has been created, based on the idea that all known words and symbols are incorporated into them.

Unicode maps these characters with 0 ~ 0x10ffff, one can store 1114112 (2^20 + 2^16) code bits, Unicode uses the first byte as a plane, there are currently 17 planes, planar 15,16 as a custom area reservation, and plane 0 has a dedicated Area: 0XE000-0XF8FF, there are 6,400 yards, while the plane 0 0xd800-0xdfff, a total of 2048 code bits, known as the agent area. Unicode is inefficient, so there are a variety of Unicode encoding methods, more famous utf-8,utf-16,utf-32, the following mainly introduced under UTF8 and UTF16.

UTF-8 is encoded in bytes for Unicode, and UTF-8 is encoded with different lengths for different ranges of code:

Unicode

BYTE code

000000 ~ 00007f 0xxxxxxx
000080 ~ 0007FF 110xxxxx 10xxxxxx
000800 ~ 007FFF 1110xxxx 10xxxxxx 10xxxxxx
008000 ~ 10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This encoding ensures that different byte encodings are used, and that the current character is made up of several bytes at the time of parsing by identifying the prefix of each byte.

UTF-16 is encoded using a 16-bit unsigned number as a code element. u+0000 ~ U+FFFF contains the commonly used characters, UTF-16 directly to the code bit of this range for the 16-bit code, for u+10000 to U+10FFFF code bit, encoded into 2 16-bit code, encoding method code is as follows:

/*** utf-16 coding algorithm, large end *@paramValue of offset code bit for *@return*/Public String convertUTF16 (IntOffset) {if (Offset < 0x10000) {int VH = (offset & (0xFF << 8)) >> 8;//Higt 8 bitsint VL = (offset & 0xFF);//Low 8 bitsReturn integer.tohexstring (VH) + "" +Integer.tohexstring (VL); }int val = Offset-0x10000; int vh = (val & (0x3ff <<)) >> 10; //higt bits int VL = ( Val & 0x3ff); //low bits int ph = 0xD800; // high proxy int PL = 0xdc00;< Span style= "color: #008000;" >// status agent String Firstsymbol = integer.tohexstring (ph + VH ); String secsymbol = integer.tohexstring (pl + VL); return Firstsymbol + "+ Secsymbol;}    

For multiple byte representations, there is also the problem of byte order (Big-endian, Little-endian), which is distinguished by the BOM (byte order mark), which transmits the BOM before the stream is transmitted. The Unicode encoding format corresponds to the BOM listed below:

UTF encoding method Bom
Utf-8 EF BB BF
Utf-16 LE FF FE
Utf-16 be FE FF
Utf-32 LE FF FE 00 00
Utf-32 be XX-FE FF

Note: Utf-8 does not require a BOM to represent its byte order due to its special encoding format

With regard to the processing of coding in Java, it is explained by several questions:

1. What is the byte order of Java, Big-endian or Little-endian?

Java is a big-endian, this problem can be verified by the program:

/*** Check byte order *@throwsUnsupportedencodingexception*/public String Checkbyteorder () throws Unsupportedencodingexception {String a = "a" ; byte[] arr = a.getbytes ("utf-16" ); if (arr[2] = 0) {// The first two bits are BOM return "Big-endian" ;} else if (arr[2] = = 97) { Span style= "color: #0000ff;" >return "Little-endian" ;} else {return "Error" Span style= "color: #000000;" >; } } 

2. Character encoding method in Java?

In the Java platform, UTF-16 encoding is used in char[], String, StringBuilder, and StringBuffer classes, and BMP characters are represented by a char, and supplementary characters are represented by a pair of char.

3. How does the Java string.getbytes () method get the encoding used by bytes?

As shown in the API, the String.getbytes () method uses the default encoding, which is related to the platform, and can be used to obtain the default encoding for the platform as follows:

Charset.defaultcharset ()

Finally, the StringBufferInputStream method is not recommended because this class is an early Java class, and the Read method only gets the low 8 bits of char, which can cause a lot of problems

int Read () {        return (POS < count)? (Buffer.charat (pos++) & 0xFF):-1;}  

Introduction to character encoding and application in Java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.