Understanding of character encoding and Unicode, ISO 10646, UCS, utf8, UTF16, GBK, and gb2312

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original article address: http://www.bobopo.com/article/code/201110/utf8_gb2312_unicode.htm's understanding of zookeeper and unicode=iso 10646, UCS, utf8, UTF16, GBK, and gb2312

Unicode

The encoding mechanism developed by unicode.org should include common texts all over the world.

In 1.0, it is 16-bit encodedU+0000ToU+FFFF. Each 2byte Code corresponds to one character. At the beginning of 2.0, the 16-bit limit was abandoned. The original 16-bit space was used as the basic bit plane, and the 16-bit plane was added, which is equivalent to 20-bit encoding and the encoding range.0To0×10FFFF.

UCs

The universal character set defined by the ISO 10646 standard is 4 bytes encoded.

Relationship between Unicode and UCS

ISO and unicode.org are two different organizations, so different standards were initially developed. However, since unicode2.0, Unicode adopts the same font and Word Code as ISO 10646-1, ISO also promises that ISO 10646 will not exceed0×10FFFFTo make the two consistent.

Specifies the encoding method of the UCS.

UCS-2, Which is basically the same as the 2-byte encoding of Unicode.
UCS-4, 4 Byte encoding, is currently added in front of the UCS-2 2 fully zero bytes.

UTF-Unicode/UCS Transformation Format

UTF-8, 8-bit encoding. ASCII is not converted. Other characters are variable-length encoding. Each character is 1-3 bytes. It is usually used as an external code. It has the following advantages: it has nothing to do with the CPU byte sequence and can communicate with each other on different platforms; its fault tolerance capability is high. If any one byte is damaged, only one encoding bit is lost at most, no chain lock error (for example, if the GB code is wrong, a byte will be garbled in the whole line ).
UTF-16, 16-bit encoding, is a variable length code, roughly equivalent to 20-bit encoding, Value0To0×10FFFFIs basically the implementation of Unicode encoding. It is a variable length code, which is related to the CPU collation, but because it saves the most space, it is often used as an external code for network transmission. The UTF-16 is the Unicode preferred encoding.
UTF-32, Only uses the Unicode range (0To0×10FFFF), Which is equivalent to a subset of the UCS-4.

UTF and Unicode

Unicode is a character set and can be viewed as an internal code.

UTF is a encoding method because Unicode is not suitable for direct transmission and processing in some scenarios. The UTF-16 is Unicode encoding directly, without transformation, but it contains0×00In encoding, the first byte of the first 256 bytecode is0×00It has special significance in the operating system (C language) and may cause problems. Using UTF-8 encoding to convert Unicode directly can avoid this problem and bring some advantages.

Chinese National Standard Code

GB 13000, equivalent to ISO 10646-1/Unicode 2.1, will be synchronized with changes to ISO 10646/Unicode standards in the future.
GBK is an extension of gb2312 to accommodate the unified Chinese Character section of Unicode 2.1 outside the gb2312 Character Set range, and some characters not included in Unicode are added.
GB 18030-2000, based on GB 13000, is an extended version of GBK for Unicode 3.0 that covers all unicode encoding, equivalent to UTF-8, UTF-16, and is a form of Unicode encoding. Variable-length encoding, which is a single-byte, dual-byte, or 4-byte character encoding. GB 18030 is backward compatible with gb2312/GBK. GB 18030 is mandatory for all non-handheld/Embedded Computer Systems in China.

What is UCs and ISO 10646?

International StandardsISO 10646DefinedUniversal Character Set). UCOS is a superset of all other character sets. It ensures bidirectional compatibility with other character sets. That is to say, if you translate any text string to the ucsformat and then translate it back to the original encoding, you will not lose any information.

UCOS contains characters used to express all known languages. It not only describes Latin, Greek, Slavic, Hebrew, Arabic, Armenia, and Georgia, but also contains hieroglyphics such as Chinese, Japanese, and Korean, as well as hirakana, Katakana, Bengali, Punjabi gurmukhi, Tamil, Kannada, Malayalam, Thai, Sichuan, and bopomofo), hangul, devangari, Gujarati, Oriya, Telugu, and other countless languages. For languages that have not yet been added, they will eventually be added because they are studying how to best encode them in computers. These languages include Tibetian, Khmer, runic (Ancient Nordic text), Ethiopian, other hieroglyphics, and a variety of Indian-European languages, it also includes selected artistic languages such as tengwar, cirth, and Klingon ). UCOS also contains a large number of graphical, printed, mathematical, and scientific symbols, including all the symbols by Tex, postscript, MS-dos, MS-Windows, Macintosh, OCR
Font, as well as characters provided by many other word processing and publishing systems.

ISO 10646 defines a 31-bit character set. However, in this huge encoding space, only the first 65534 bitwise (0×0000To0xFFFD). The six-bit subset of this UCS is calledBasic multilingual plane (BMP). Characters other than 16-bit BMP are special characters (such as hieroglyphics) and will only be used by experts in the field of history and science. According to the current plan, no characters may be allocated to the slave node in the future.0×000000To0×10FFFFThis overwrites the 21-bit encoding space of more than 1 million potential future characters. ISO
The 10646-1 standard was first published in 1993 and defines the architecture of the character set and BMP content. The second part of the character encoding other than BMP is being prepared, but it may take several years to complete. New characters are continuously added to BMP, but the existing characters are stable and will not change.

In addition to assigning a code to each character, the UCOS also gives a formal name. It indicates the hexadecimal number of a ucus or Unicode value.U+", Just likeU+0041It represents the character "Latin capital letter ". UCs charactersU+0000ToU+007FConsistent with US-ASCII (ISO 646,U+0000ToU+00FFIt is also consistent with ISO 8859-1 (Latin-1. SlaveU+E000ToU+F8FF, Already BM
Outside of the large range of encoding is reserved for private use.

What is a composite character?

Some encoding points are assignedCharacter combination. They are similar to the non-separated accent keys on the typewriter. A single character combination is not a complete character. It is a symbol similar to a heavy note or other indicator. It is added after the first character. Therefore, a duplicate note can be added after any character. The most important characters are used in orthographies of common languages. They all have their own positions in the UCS, to ensure backward compatibility with old character sets. It has both its own encoding position and can be expressed as an aggravated character that is followed by a combination of common characters. It is calledPrecomposed
Characters). The pre-encoding characters in the UCS are intended to be the same as the old encoding without pre-encoding characters, such as ISO 8859, to maintain backward compatibility. The character combination mechanism allows you to add duplicate notes or other indications behind any character, which is particularly useful in scientific symbols, such as mathematical equations and international phonetic alphabet letters, you may need to combine one or more indication marks after a basic character.

Character combination followed by modified characters. For example, the vowels in German ("uppercase Latin letters a plus notes") can be expressed as UCS.U+00C4Can also be expressed as an ordinary "uppercase Latin letter A" followed by a "combination of notes ":U+0041 U+0308Such a combination. You can use multiple composite characters when you need to Stack multiple duplicate notes or add a composite mark to and from the top and bottom of a basic character. For example, in Thai text, a basic character can contain up to two composite characters.

What is the ucsimplementation level?

Not all systems need to support all advanced mechanisms in the UCOS, such as composite characters. Therefore, ISO 10646 specifies the following three implementation levels:

Level 1, which does not support combination of characters and hangul jamo characters (a special and more complex Korean-style encoding that uses two or three sub-characters to encode a Korean syllable)
Level 2, similar to level 1, but in some texts, a fixed column of composite characters (for example, hebrew, Arabic, devangari, Bengali, Gujarati, Oriya, Tamil, telugo, Hindi (German), Malayalam, Thai and Spanish ). Without the minimum combination of characters, the UCS cannot fully express these languages.
Level 3: supports all UCOS characters. For example, a mathematician can add a Tilde to any character (the Tilde above the Spanish letter ~) Or an arrow (or both ).

What is Unicode

Historically, there were two independent attempts to create a single character set. One is
The ISO 10646 project of the International Organization for Standardization (ISO), also organized by an association consisting of (mostly American) multilingual software manufacturers
Unicode project. Fortunately, around 1991, participants from both projects realized that the world does not need two different single character sets. They combine the work of both parties and work together to create a single encoding table. Both projects still exist and their respective standards are published independently, but Unicode Association and ISO/IEC JTC1/SC2 both agree to maintain compatibility with Unicode and ISO 10646 standard code tables, and closely adjust any future expansion.

So what is the difference between Unicode and ISO 10646?

Unicode standards published by the Unicode Association strictly include the basic multilingual aspect of ISO 10646-1 implementation level 3. All characters in the two standards are at the same position and have the same name.

The Unicode Standard defines a number of characters-related semantic Enis, which is generally a better reference for high-quality printing and publishing systems. Unicode describes in detail the algorithms used to draw expressions in certain languages (such as Arabic), the algorithms used to process bidirectional texts (such as Latin and Hebrew mixed texts), and the algorithms required to compare sorting with strings, and many other things.

On the other hand, the ISO 10646 standard, like the well-known ISO 8859 standard, is just a simple character set table. It specifies some standards-related terms, defines some encoding aliases, and includes standard instructions, specifying how to use the UCS to connect to other ISO standards, for example, ISO 6429 and ISO 2022. There are also some closely related to ISO, for example, ISO 14651 is about the quality of the UCS string sorting.

Considering that the Unicode Standard has an easy-to-remember name and is included in Addison-Wesley in any good bookstore, it only takes a small part of the ISO version and includes more auxiliary information, therefore, it is not surprising that it has become a widely used reference. However, it is generally believed that the quality of fonts used to print ISO 10646-1 is higher than that used to print Unicode 2.0 in some respects. Professional font designers are always advised to implement both standards, but some sample fonts are significantly different. The ISO 10646-1 standard also uses four different style variants to display ideographic texts, such as Chinese, Japanese, and Korean (CJK), while the Unicode 2.0 table only contains Chinese variants. This leads to the widespread belief that Unicode is unacceptable to Japanese users, even if it is incorrect.

What is UTF-8

First, only an integer is assigned to the character encoding table. There are several methods to represent a string of characters as a string of bytes. The two most obvious methods are to store Unicode text as strings of 2 or 4 byte sequences. The formal names of the two methods are UCS-2 and UCS-4, respectively. Unless otherwise specified, most of the bytes are like this (bigendian Convention ). Convert an ascii or Latin-1 file to a UCS-2 simply insert before each ASCII byte0×00. To convert to UCS-4, you must insert three before each ASCII byte0×00.

Using UCS-2 (or UCS-4) in UNIX can cause very serious problems. The encoded string contains some special characters, such'\0'Or'/', They have special meanings in the file name and other C library function parameters. In addition, most UNIX tools that use ASCII files cannot read 16 characters without making major changes. For these reasons, UCS-2 is not suitable for external unicode encoding in areas such as file names, text files, environment variables, and so on.

In ISO 10646-1
Annex R and
Defined in RFC 2279UTF-8Encoding does not solve these problems. It is an obvious way to use Unicode in a UNIX operating system.

Characteristics of UTF-8

UCs charactersU+0000ToU+007F(ASCII) is encoded as byte0×00To0×7F(ASCII compatibility ). This means that files containing only 7 ASCII characters are the same in both ASCII and UTF-8 encoding methods.
All>U+007F. Therefore, ASCII bytes (0×00-0×7F) Cannot be part of any other character.
Indicates that the first byte of a Multi-byte string with non-ASCII characters is always in0xC0To0xFDAnd specify the number of bytes that the character contains. The Rest Of The multibyte strings are0×80To0xBF. This makes re-synchronization very easy, and makes the encoding without borders, and is rarely affected by the loss of bytes.
All possible 2 ^ 31 UCS can be compiled.
In theory, UTF-8 encoding characters can be up to 6 bytes long, but 16-bit BMP characters can be up to 3 bytes long.
The order of the bigendian UCS-4 byte strings is predetermined.
Bytes 0xfe and 0xff are never used in UTF-8 encoding.

The following byte string is used to indicate a character. The string used depends on the character's serial number in Unicode.

U-00000000-U-0000007F:0xxxxxxx
U-00000080-U-000007FF:110xxxxx
10xxxxxx
U-00000800-U-0000FFFF:1110xxxx
10xxxxxx 10xxxxxx
U-00010000-U-001FFFFF:11110xxx
10xxxxxx 10xxxxxx
10xxxxxx
U-00200000-U-03FFFFFF:111110xx
10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx
U-04000000-U-7FFFFFFF:1111110x
10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx
10xxxxxx

xxxThe location is filled in by the bits of the number of characters encoded in binary format. Closer to the rightxIt has less special significance. Use only the shortest multi-byte string that is sufficient to express the number of characters encoded. Note that in a multi-byte string, the first byte starts with"1"Is the number of bytes in the entire string.

For example, Unicode charactersU+00A9=1010 1001(Copyright) the code in the UTF-8 is:

11000010 10101001 = 0xC2 0xA9

CharacterU+2260=0010 0010 0110 0000(Not equal to) encoding:

11100010 10001001 10100000 = 0xE2 0×89 0xA0

The official name of this encoding is spelled as a UTF-8, where UTF representsUCSTRansformation
FOrmat. Do not use other names (such as utf8 or utf_8) in any document to represent the UTF-8 unless you are referring to a variable name rather than the encoding itself.

What programming languages support Unicode

Most modern programming languages developed around 1993 have a special data type called Unicode/ISO 10646-1 characters. The name is wide_character in ada95 and char in Java.

Iso c also details the mechanism for processing multibyte encoding and wide characters. In September 1994, Amendment 1 to iso c added more. These mechanisms are mainly designed for various types of East Asian code, which are much more robust than what is needed to process the UCS. UTF-8 is an example of Iso c Standard calling multibyte string encoding,wchar_tType can be used to store Unicode characters.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Understanding of character encoding and Unicode, ISO 10646, UCS, utf8, UTF16, GBK, and gb2312

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Understanding of character encoding and Unicode, ISO 10646, UCS, utf8, UTF16, GBK, and gb2312

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support