Differences between Unicode and UTF-8/UTF-16/UTF-32

Source: Internet
Author: User
The original objective of Unicode is to use a 16-bit encoding to provide ing for over 65000 characters. However, this is not enough. It cannot cover all historical texts or solve the implantation head-ache problem, especially in network-based applications. The existing software must do a lot of work to program 16-bit data. Therefore, Unicode uses three encoding methods with some basic reserved characters. They are UTF-8, UTF-16, and UTF-32 respectively. As the name suggests, in a UTF-8, a character is encoded in an 8-bit sequence and represents a character in one or several bytes. The biggest benefit of this approach is that the UTF-8 retains the ASCII character encoding as part of it, for example, in the UTF-8 and ASCII, "a" encoding is 0x41.
The UTF-16 and UTF-32 are Unicode 16-bit and 32-bit encoding methods, respectively. Given the initial purpose, Unicode is typically a UTF-16. When discussing Unicode, it is very important to determine which encoding method is used. For technical introduction to unicdoe, see http://www.unicode.org/unicode/standard/principles.html.

UTF-8/UTF-16/UTF-32

UTF, the Unicode transformer format, is the actual representation of the Unicode Code Point, divided into UTF-8/16/32 by the number of digits of its basic length. It can also be considered as a special external data encoding, but it can be one-to-one correspondence with Unicode code points.

The UTF-8 is variable-length encoding, and each Unicode code point can have 1-3 bytes of different lengths according to different ranges.
// The UTF-8 is the compressed unicode encoding method.

The length of the UTF-16 is relatively fixed, as long as the characters in the range of \ u200000 are not processed, each Unicode code point is represented in 16-bit, 2-byte, and the excess is represented in two UTF-16, 4-byte. According to the high and low byte order, is divided into UTF-16BE/UTF-16LE.

The UTF-32 length is always fixed, and each Unicode code point is represented in 32-bit, 4-byte. According to the high and low byte order, is divided into UTF-32BE/UTF-32LE.

UTF Encoding has the following advantages: although the number of encoded bytes is not the same as that of gb2312/GBK encoding, you must start from the text to locate Chinese characters correctly. In UTF Encoding, based on a relatively fixed algorithm, you can know from the current position whether the current byte is the beginning or end of a code point, so as to relatively simple character location. However, UTF-32 is the easiest way to locate the problem. It does not require character locating at all, but the relative size also increases a lot.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.