One, Unicode character set
UTF encoding, full name is Unicode Transformer format, which is the actual form of UCS (Universal mutiple-octet doded Character set, International standard ISO10646-defined universal Character set), Its classification is based on the number of bits occupied by its basic length, divided into three forms of utf-8/16/32. UTF can be said to be a collection of other character sets, it makes the other character sets are cross-compatible, it can be said that the text symbol into the UCS and then back to the original encoding, will not lose information. UCS contains all of the currently known language characters, from Latin, Greek to Chinese, Korean and other hieroglyphs, to Japanese hiragana, katakana and many other languages. So using UTF for program development is definitely the first choice for program internationalization, and Unicode unifies the language of the world to form the greatest character set.
Second, code point and Code unit
Code points and code units are terms from the Unicode standard, and the core of the Unicode standard is an encoded character set.
Code point: The code value that corresponds to a character in a Unicode-encoded table.
Code unit: A char in Java that can be understood as a basic unit of character encoding.
Third, the encoding method
The Unicode encoding space assigns characters from u+000000 to u+10ffff,unicode4.0 to the 96,382 code points in these 1,112,064 code points.
Unicode has a coded space of 17 planes, each containing 216 (65536) code bits. 17 Flat code bits can be represented as "u+xx0000" to "u+xxffff" (xx means hexadecimal from 0x00 to 0x10, a total of 17 planes).
The characters from u+000000 to U+00FFFF are called basic Multilingual planes (BMP). This is the original standard for 16-bit encoding, and early due to the erroneous estimation of the capacity range of the code point, it is considered that Unicode requires only 2^16 code points.
The other planes are auxiliary planes, which are characters between the code points u+10000 to U+10ffff ranges, also known as supplementary characters, which are characters that cannot be represented by 16-bit designs that use the original Unicode.
Iv. representation of Unicode in Java
A char is used in Java to represent Unicode characters, since just beginning Unicode uses a maximum of 16bit. Therefore, char can represent all Unicode characters. Later, due to Unicode4.0, Unicode supported characters far more than 65,536 characters. Therefore, Char cannot now represent all Unicode characters. Can only represent characters between 0x000000 and 0x00ffff. In other words, Char cannot represent supplementary characters.
In Java, all Unicode code points are represented by Int. The 21 low (least significant bit) of int is used to represent the Unicode code point, and 11 highs (the most significant bit) must be zero. In other words, int can represent an additional character that Char cannot represent.
Wu, UTF-8, UTF-16, UTF-32
UTF, which is the actual representation of a Unicode code point, is divided into utf-8/16/32 by the number of bits used for its base length. It can also be thought of as a special external data encoding, but can correspond to a Unicode code point of one by one.
UTF-8 are variable-length encodings, each Unicode code point can have a different length of 1-3 bytes depending on the range. is the compressed Unicode encoding method.
The UTF-16 length is relatively fixed, as long as the characters larger than the \u200000 range are not processed, each Unicode code point is represented by a 16-bit, 2-byte, and the excess portion uses two UTF-16, or 4 bytes. According to the Order of high and low bit byte, it is divided into utf-16be/utf-16le.
The UTF-32 length is always fixed, with each Unicode code point using 32-bit or 4-byte representation. According to the Order of high and low bit byte, it is divided into utf-32be/utf-32le.
UTF encoding has the advantage that, although the number of encoded bytes, but not like GB2312/GBK encoding, need to start from the text to find, in order to correctly locate the Chinese characters. Under UTF encoding, depending on the relative fixed algorithm, it is possible to know from the current position whether the current byte is the beginning or the end of a code point, thus making the character positioning relatively simple. However, the simplest location problem is UTF-32, it does not need to do character positioning, but the relative size also increased a lot.
UTF-32 a 32-bit integer that represents each Unicode code point as the same value. It is obvious that it is the most convenient expression for internal processing, but if it is expressed as a general string, it consumes more memory.
UTF-16 encodes Unicode code points using a sequence of one or two unassigned 16-bit code units. The value u+0000 to U+FFFF is encoded as a 16-bit unit of the same value. The supplementary character encoding is two code units, the first cell is from the high-agent range (u+d800 to U+DBFF), and the second cell is from a low-agent range (u+dc00 to U+DFFF). This may seem conceptually similar to multibyte encoding, but there is one important difference: The value u+d800 to U+dfff is reserved for UTF-16, and no such value is assigned as a code point. This means that for each individual unit of code in a string, the software can identify whether the unit of code represents a single-element character, or whether the unit of code is the first or second unit of a two-cell character. This is a significant improvement over some traditional multibyte character encodings, where the byte value 0x41 may represent either the letter "a" or the second byte of a double-byte character in the traditional multibyte character encoding.
UTF-8 encodes the encoded Unicode code point using a sequence of one to four bytes. u+0000 to u+007f uses a byte encoding, u+0080 to u+07ff uses two bytes, u+0800 to U+ffff uses three bytes, and u+10000 to U+10ffff uses four bytes. UTF-8 design principle is: Byte value 0x00 to 0x7f always represents code point u+0000 to u+007f (Basic Latin character subset, which corresponds to the ASCII character set). These byte values never represent other code points, and this feature makes it easy for UTF-8 to assign special meanings to certain ASCII characters in the software.
The following table shows a comparison of the different expressions for several characters:
Unicode code Point |
u+0041 |
U+00df |
u+6771 |
u+10400 |
Represents a glyph |
A |
? |
East |
?? (char is not recognized) |
UTF-32 code unit |
|
|
|
|
00010400 |
UTF-16 code unit |
|
|
|
|
UTF-8 code Unit |
|
|
|
|
Unicode code points and encoding methods