What is Unicode? What is UTF-8?

Source: Internet
Author: User
Tags character set

The coding aspect has always been not very high, so it is not known about Unicode and UTF-8.

Recently accidentally turned to a UTF-8 article, feel the explanation of the very complex, so just thought to write a simple and understandable.

Let's begin by explaining some of the coding schemes that are now commonly used:
1, in China, the most commonly used in the mainland is GBK18030 code, in addition to the gbk,gb2312, the relationship between these several encodings is this.

The earliest encoding were GB2312, including 6,763 Chinese characters and 682 other symbols.

95 revised the code, named GBK1.0, a total of 21,886 symbols were included.

After the introduction of the GBK18030 code, a total of 27,484 Chinese characters, but also included Tibetan, Mongolian, Uighur and other major minority languages, now Windows platform must support GBK18030 coding.

According to the order of GBK18030, GBK and GB2312, 3 encodings are backward compatible, and the same characters are encoded in three coding schemes.

2, Taiwan, Hong Kong and other uses are BIG5 code

3, Japan: Sjis code

Unicode is a language developed by all the countries in the world if we describe all kinds of text coding as dialects of different places.

In this language environment, there will be no more language coding conflicts, under the same screen, can display any language content, this is the greatest advantage of Unicode.

So how is Unicode encoded? actually very simple.

is to encode all the text in the world in 2 bytes. You might ask, 2 bytes can represent up to 65,536 encodings, is it enough?

Most of the Chinese characters in Korea and Japan are spread from China, and the font is exactly the same.

For example: "Wen" word, GBK and sjis are all the same characters, but the code is different.

That way, a unified code like this, 2 bytes is enough to accommodate most of the language in the world.

The scientific name of Unicode is "universal Multiple-octet coded Character Set", referred to as UCS.

Now we are using UCS-2, which is 2 byte encoding, and UCS-4 is developed to prevent the future of 2 bytes from being insufficient. UCS-2 is also known as the basic multilingual plane.

UCS-2 conversion to UCS-4 is simply preceded by a 2 byte 0.

UCS-4 is primarily used to save auxiliary planes, such as the second auxiliary plane in Unicode 4.0

20000-20fff-21000-21fff-22000-22fff-23000-23fff-24000-24fff-25000-25fff-26000-26fff-27000-27fff-28000-28 Fff-29000-29fff-2a000-2afff-2f000-2ffff

A total of 16 auxiliary planes were added, extending from the original 65,536 encodings to nearly 1 million encodings.

So since the unified coding, how to compatible with the original state of the text encoding it?

This time will need codepage.

What is codepage? CodePage is the mapping table between the text encoding and Unicode of each country.

For example, the Simplified Chinese and Unicode mapping table is CP936, click here to view the official mapping table.

Here are a few commonly used codepage, the corresponding changes to the above address of the number can be.

codepage=936 Simplified Chinese GBK

codepage=950 Traditional Chinese BIG5

codepage=437 American/Canadian English

codepage=932 Japanese

codepage=949 Han Wen

codepage=866 Russian

codepage=65001 Unicode UFT-8

The last 65001, according to personal understanding, should be just a virtual mapping table, is actually just an algorithm.

Take a line from 936, for example:

0x9993 0X6ABD #CJK Unified Ideograph

The preceding encoding is GBK encoding, followed by Unicode.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.