Unicode UTF-8 UTF-16

Source: Internet
Author: User

UNICODE:
The encoding mechanism developed by unicode.org should include common texts all over the world.
In 1.0, It is a 16-bit code, from u + 0000 to U + FFFF. each 2byte Code corresponds to one character. At the beginning of 2.0, the 16-bit limit was abandoned. The original 16-bit is used as the basic bit plane, and the 16-bit plane is added, which is equivalent to 20-bit encoding, the encoding range is 0 to 0x10ffff.

UCs:
The universal character set defined in iso000046 according to ISO, which adopts 4 Byte encoding.

UNICODE:
ISO and unicode.org are two different organizations, so different standards were initially developed. However, since unicode2.0, Unicode adopts the same font and Word Code as ISO 10646-1, ISO also promises that the iso000046 will not assign a value to the UCS-4 code that exceeds 0x10ffff, so that the two are consistent.

The encoding method of UCS:

  • UCS-2, which is basically the same as the 2 byte encoding of Unicode.
  • UCS-4, 4 Byte encoding, is currently added in front of the UCS-2 2 fully zero byte.

    UTF: Unicode/UCOS Transformation Format

  • UTF-8, 8bit encoding, ASCII do not change, other characters do Variable Length Encoding, each character 1-3 byte. Usually used as an external code. has the following advantages:
    * It is irrelevant to the CPU byte sequence and can communicate with each other on different platforms.
    * High Fault Tolerance. If any one byte is damaged, only one encoding bit will be lost at most, and no chainlock error will occur (for example, if one byte is incorrect, the entire line will be garbled)
  • UTF-16, 16-bit encoding, is a variable length code, roughly equivalent to 20-bit encoding, the value between 0 and 0x10ffff, basically is the implementation of Unicode encoding. it is a variable length code, which is related to the CPU order, but because it saves the most space, it is often used as an external code for network transmission. the UTF-16 is Unicode preferred encoding. unicode generally refers to the UTF-16.
  • UTF-32, uses only 32-bit encoding in the Unicode range (0 to 0x10ffff), equivalent to a subset of the UCS-4.

    UTF and UNICODE:
    Unicode is a character set and can be viewed as an internal code.
    UTF is a encoding method because Unicode is not suitable for direct transmission and processing in some scenarios. UTF-16 is Unicode encoding directly, no transformation, but it contains 0x00 in the encoding, the first byte of the first 256 bytecode is 0x00, in the operating system (C language) it has special significance and may cause problems. using UTF-8 encoding to convert Unicode directly can avoid this problem and bring some advantages. the UTF-8 is encoded in bytes and there is no issue of bytecode. The UTF-16 uses two bytes as the encoding unit. before interpreting a UTF-16 text, you must first understand the byte order of each encoding unit. For example, if the Unicode encoding of "queue" is 594e and that of "B" is 4e59. If we receive the UTF-16 byte stream "594e", is this "Kui" or "B "? The recommended method for marking byte order in Unicode specifications is Bom.

    Big endian and little endian
    Big endian and little endian are different ways for CPUs to process the number of multi-word segments. For example, the Unicode encoding of the Chinese character is
    6c49. When I write a file, do I write 6C in front or 49 in front? If you write 6C in front
    It is big endian. Write 49 in front, that is, little endian.

    Chinese Encoding
    From gb2312 to GBK to gb18030, more and more Chinese characters are supported. Gb2312 and GBK to gb18030 both belong to the dual-byte character set. The encoding method from ASCII, gb2312, GBK to gb18030 is backward compatible.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.