Unicode code points and encoding methods

Source: Internet
Author: User

One, Unicode character set

UTF encoding, full name is Unicode Transformer format, which is the actual form of UCS (Universal mutiple-octet doded Character set, International standard ISO10646-defined universal Character set), Its classification is based on the number of bits occupied by its basic length, divided into three forms of utf-8/16/32. UTF can be said to be a collection of other character sets, it makes the other character sets are cross-compatible, it can be said that the text symbol into the UCS and then back to the original encoding, will not lose information. UCS contains all of the currently known language characters, from Latin, Greek to Chinese, Korean and other hieroglyphs, to Japanese hiragana, katakana and many other languages. So using UTF for program development is definitely the first choice for program internationalization, and Unicode unifies the language of the world to form the greatest character set.

Second, code point and Code unit

Code points and code units are terms from the Unicode standard, and the core of the Unicode standard is an encoded character set.

Code point: The code value that corresponds to a character in a Unicode-encoded table.

Code unit: A char in Java that can be understood as a basic unit of character encoding.

Third, the encoding method

The Unicode encoding space assigns characters from u+000000 to u+10ffff,unicode4.0 to the 96,382 code points in these 1,112,064 code points.

Unicode has a coded space of 17 planes, each containing 216 (65536) code bits. 17 Flat code bits can be represented as "u+xx0000" to "u+xxffff" (xx means hexadecimal from 0x00 to 0x10, a total of 17 planes).

The characters from u+000000 to U+00FFFF are called basic Multilingual planes (BMP). This is the original standard for 16-bit encoding, and early due to the erroneous estimation of the capacity range of the code point, it is considered that Unicode requires only 2^16 code points.

The other planes are auxiliary planes, which are characters between the code points u+10000 to U+10ffff ranges, also known as supplementary characters, which are characters that cannot be represented by 16-bit designs that use the original Unicode.

Iv. representation of Unicode in Java

A char is used in Java to represent Unicode characters, since just beginning Unicode uses a maximum of 16bit. Therefore, char can represent all Unicode characters. Later, due to Unicode4.0, Unicode supported characters far more than 65,536 characters. Therefore, Char cannot now represent all Unicode characters. Can only represent characters between 0x000000 and 0x00ffff. In other words, Char cannot represent supplementary characters.

In Java, all Unicode code points are represented by Int. The 21 low (least significant bit) of int is used to represent the Unicode code point, and 11 highs (the most significant bit) must be zero. In other words, int can represent an additional character that Char cannot represent.

Wu, UTF-8, UTF-16, UTF-32

UTF, which is the actual representation of a Unicode code point, is divided into utf-8/16/32 by the number of bits used for its base length. It can also be thought of as a special external data encoding, but can correspond to a Unicode code point of one by one.

UTF-8 are variable-length encodings, each Unicode code point can have a different length of 1-3 bytes depending on the range. is the compressed Unicode encoding method.

The UTF-16 length is relatively fixed, as long as the characters larger than the \u200000 range are not processed, each Unicode code point is represented by a 16-bit, 2-byte, and the excess portion uses two UTF-16, or 4 bytes. According to the Order of high and low bit byte, it is divided into utf-16be/utf-16le.

The UTF-32 length is always fixed, with each Unicode code point using 32-bit or 4-byte representation. According to the Order of high and low bit byte, it is divided into utf-32be/utf-32le.

UTF encoding has the advantage that, although the number of encoded bytes, but not like GB2312/GBK encoding, need to start from the text to find, in order to correctly locate the Chinese characters. Under UTF encoding, depending on the relative fixed algorithm, it is possible to know from the current position whether the current byte is the beginning or the end of a code point, thus making the character positioning relatively simple. However, the simplest location problem is UTF-32, it does not need to do character positioning, but the relative size also increased a lot.

UTF-32 a 32-bit integer that represents each Unicode code point as the same value. It is obvious that it is the most convenient expression for internal processing, but if it is expressed as a general string, it consumes more memory.

UTF-16 encodes Unicode code points using a sequence of one or two unassigned 16-bit code units. The value u+0000 to U+FFFF is encoded as a 16-bit unit of the same value. The supplementary character encoding is two code units, the first cell is from the high-agent range (u+d800 to U+DBFF), and the second cell is from a low-agent range (u+dc00 to U+DFFF). This may seem conceptually similar to multibyte encoding, but there is one important difference: The value u+d800 to U+dfff is reserved for UTF-16, and no such value is assigned as a code point. This means that for each individual unit of code in a string, the software can identify whether the unit of code represents a single-element character, or whether the unit of code is the first or second unit of a two-cell character. This is a significant improvement over some traditional multibyte character encodings, where the byte value 0x41 may represent either the letter "a" or the second byte of a double-byte character in the traditional multibyte character encoding.

UTF-8 encodes the encoded Unicode code point using a sequence of one to four bytes. u+0000 to u+007f uses a byte encoding, u+0080 to u+07ff uses two bytes, u+0800 to U+ffff uses three bytes, and u+10000 to U+10ffff uses four bytes. UTF-8 design principle is: Byte value 0x00 to 0x7f always represents code point u+0000 to u+007f (Basic Latin character subset, which corresponds to the ASCII character set). These byte values never represent other code points, and this feature makes it easy for UTF-8 to assign special meanings to certain ASCII characters in the software.

The following table shows a comparison of the different expressions for several characters:

Unicode code Point

u+0041

U+00df

u+6771

u+10400

Represents a glyph

A

?

East

?? (char is not recognized)

UTF-32 code unit

00000041

tbody>

000000DF

00006771

00010400

UTF-16 code unit

0041

00DF

6771

D801

DC00

UTF-8 code Unit

41

C3

9F

E6

9D

B1

F0

90

90

80

Unicode code points and encoding methods

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.