The original objective of Unicode is to use a 16-bit encoding to provide ing for over 65000 characters. However, this is not enough. It cannot cover all historical texts or solve the implantation head-ache problem, especially in network-based applications. The existing software must do a lot of work to program 16-bit data. Therefore, Unicode uses three encoding methods with some basic reserved characters. They are UTF-8, UTF-16, and UTF-32 respectively. As the name suggests, in a UTF-8, a character is encoded in an 8-bit sequence and represents a character in one or several bytes. The biggest benefit of this approach is that the UTF-8 retains the ASCII character encoding as part of it, for example, in the UTF-8 and ASCII, "a" encoding is 0x41.
The UTF-16 and UTF-32 are Unicode 16-bit and 32-bit encoding methods, respectively. Given the initial purpose, Unicode is typically a UTF-16. When discussing Unicode, it is very important to determine which encoding method is used. For technical introduction to unicdoe, see http://www.unicode.org/unicode/standard/principles.html.
UTF-8/UTF-16/UTF-32
UTF, the Unicode transformer format, is the actual representation of the Unicode Code Point, divided into UTF-8/16/32 by the number of digits of its basic length. It can also be considered as a special external data encoding, but it can be one-to-one correspondence with Unicode code points.
The UTF-8 is variable-length encoding, and each Unicode code point can have 1-3 bytes of different lengths according to different ranges.
// The UTF-8 is the compressed unicode encoding method.
The length of the UTF-16 is relatively fixed, as long as the characters in the range of \ u200000 are not processed, each Unicode code point is represented in 16-bit, 2-byte, and the excess is represented in two UTF-16, 4-byte. According to the high and low byte order, is divided into UTF-16BE/UTF-16LE.
The UTF-32 length is always fixed, and each Unicode code point is represented in 32-bit, 4-byte. According to the high and low byte order, is divided into UTF-32BE/UTF-32LE.
UTF Encoding has the following advantages: although the number of encoded bytes is not the same as that of gb2312/GBK encoding, you must start from the text to locate Chinese characters correctly. In UTF Encoding, based on a relatively fixed algorithm, you can know from the current position whether the current byte is the beginning or end of a code point, so as to relatively simple character location. However, UTF-32 is the easiest way to locate the problem. It does not require character locating at all, but the relative size also increases a lot.