What is Unicode? What is UTF-8?

Last Update:2017-02-27 Source: Internet

Author: User

Tags character set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The coding aspect has always been not very high, so it is not known about Unicode and UTF-8.

Recently accidentally turned to a UTF-8 article, feel the explanation of the very complex, so just thought to write a simple and understandable.

Let's begin by explaining some of the coding schemes that are now commonly used:
1, in China, the most commonly used in the mainland is GBK18030 code, in addition to the gbk,gb2312, the relationship between these several encodings is this.

The earliest encoding were GB2312, including 6,763 Chinese characters and 682 other symbols.

95 revised the code, named GBK1.0, a total of 21,886 symbols were included.

After the introduction of the GBK18030 code, a total of 27,484 Chinese characters, but also included Tibetan, Mongolian, Uighur and other major minority languages, now Windows platform must support GBK18030 coding.

According to the order of GBK18030, GBK and GB2312, 3 encodings are backward compatible, and the same characters are encoded in three coding schemes.

2, Taiwan, Hong Kong and other uses are BIG5 code

3, Japan: Sjis code

Unicode is a language developed by all the countries in the world if we describe all kinds of text coding as dialects of different places.

In this language environment, there will be no more language coding conflicts, under the same screen, can display any language content, this is the greatest advantage of Unicode.

So how is Unicode encoded? actually very simple.

is to encode all the text in the world in 2 bytes. You might ask, 2 bytes can represent up to 65,536 encodings, is it enough?

Most of the Chinese characters in Korea and Japan are spread from China, and the font is exactly the same.

For example: "Wen" word, GBK and sjis are all the same characters, but the code is different.

That way, a unified code like this, 2 bytes is enough to accommodate most of the language in the world.

The scientific name of Unicode is "universal Multiple-octet coded Character Set", referred to as UCS.

Now we are using UCS-2, which is 2 byte encoding, and UCS-4 is developed to prevent the future of 2 bytes from being insufficient. UCS-2 is also known as the basic multilingual plane.

UCS-2 conversion to UCS-4 is simply preceded by a 2 byte 0.

UCS-4 is primarily used to save auxiliary planes, such as the second auxiliary plane in Unicode 4.0

20000-20fff-21000-21fff-22000-22fff-23000-23fff-24000-24fff-25000-25fff-26000-26fff-27000-27fff-28000-28 Fff-29000-29fff-2a000-2afff-2f000-2ffff

A total of 16 auxiliary planes were added, extending from the original 65,536 encodings to nearly 1 million encodings.

So since the unified coding, how to compatible with the original state of the text encoding it?

This time will need codepage.

What is codepage? CodePage is the mapping table between the text encoding and Unicode of each country.

For example, the Simplified Chinese and Unicode mapping table is CP936, click here to view the official mapping table.

Here are a few commonly used codepage, the corresponding changes to the above address of the number can be.

codepage=936 Simplified Chinese GBK

codepage=950 Traditional Chinese BIG5

codepage=437 American/Canadian English

codepage=932 Japanese

codepage=949 Han Wen

codepage=866 Russian

codepage=65001 Unicode UFT-8

The last 65001, according to personal understanding, should be just a virtual mapping table, is actually just an algorithm.

Take a line from 936, for example:

0x9993 0X6ABD #CJK Unified Ideograph

The preceding encoding is GBK encoding, followed by Unicode.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More