What is Unicode, what is utf-8_css/html

Source: Internet
Author: User
Tags control characters
The coding aspect has always been not very high, so it is not known about Unicode and UTF-8.
Recently accidentally turned to a UTF-8 article, feel the explanation of the very complex, so just thought to write a simple and understandable.


Let's begin by explaining some of the coding schemes that are now commonly used:
1, in China, the most commonly used in the mainland is GBK18030 code, in addition to the gbk,gb2312, the relationship between these several encodings is this.
The earliest encoding were GB2312, including 6,763 Chinese characters and 682 other symbols.
95 revised the code, named GBK1.0, a total of 21,886 symbols were included.
After the introduction of the GBK18030 code, a total of 27,484 Chinese characters, but also included Tibetan, Mongolian, Uighur and other major minority languages, now Windows platform must support GBK18030 coding.

According to the order of GBK18030, GBK and GB2312, 3 encodings are backward compatible, and the same characters are encoded in three coding schemes.

2, Taiwan, Hong Kong and other uses are BIG5 code
3, Japan: Sjis code

Unicode is a language developed by all the countries in the world if we describe all kinds of text coding as dialects of different places.
In this language environment, there will be no more language coding conflicts, under the same screen, can display any language content, this is the greatest advantage of Unicode.

So how is Unicode encoded? actually very simple.
is to encode all the text in the world in 2 bytes. You might ask, 2 bytes can represent up to 65,536 encodings, is it enough?
Most of the Chinese characters in Korea and Japan are spread from China, and the font is exactly the same.
For example: "Wen" word, GBK and sjis are all the same characters, but the code is different.
That way, a unified code like this, 2 bytes is enough to accommodate most of the language in the world.

The scientific name of Unicode is "universal Multiple-octet coded Character Set", referred to as UCS.
Now we are using UCS-2, which is 2 byte encoding, and UCS-4 is developed to prevent the future of 2 bytes from being insufficient. UCS-2 is also known as the basic multilingual plane.
UCS-2 conversion to UCS-4 is simply preceded by a 2 byte 0.
UCS-4 is primarily used to save auxiliary planes, such as the second auxiliary plane in Unicode 4.0
20000-20fff-21000-21fff-22000-22fff-23000-23fff-24000-24fff-25000-25fff-26000-26fff-27000-27fff-28000-28 Fff-29000-29fff-2a000-2afff-2f000-2ffff
A total of 16 auxiliary planes were added, extending from the original 65,536 encodings to nearly 1 million encodings.

So since the unified coding, how to compatible with the original state of the text encoding it?
This time will need codepage.
What is codepage? CodePage is the mapping table between the text encoding and Unicode of each country.
For example, the Simplified Chinese and Unicode mapping table is CP936, click here to view the official mapping table.

Here are a few commonly used codepage, the corresponding changes to the above address of the number can be.
codepage=936 Simplified Chinese GBK
codepage=950 Traditional Chinese BIG5
codepage=437 American/Canadian English
codepage=932 Japanese
codepage=949 Han Wen
codepage=866 Russian
codepage=65001 Unicode UFT-8

The last 65001, according to personal understanding, should be just a virtual mapping table, is actually just an algorithm.

Take a line from 936, for example:
0x9993 0X6ABD #CJK Unified Ideograph
The preceding encoding is GBK encoding, followed by Unicode.
By looking at this table, you can simply implement the conversion between GBK and Unicode.


Now that we understand Unicode, then what is UTF-8? And why would there be UTF-8?

ASCII is converted to UCS-2, just inserting a 0x0 before encoding. With these encodings, there will be some control characters, such as or/, that will cause serious errors in UNIX and some C functions. It is therefore certain that UCS-2 does not cooperate with the external encoding of Unicode.

Therefore, only then was born the UTF-8. So how is the UTF-8 encoded? How to solve the problem of UCS-2?

Cases:
E4 BD A0 11100100 10111101 10100000
This is the UTF-8 code for the word "you."
4F 60 01001111 01100000
This is the Unicode code for "you."

According to UTF-8 coding rules, the decomposition is as follows: xxxx0100 xx111101 xx100000
Splicing numbers other than x into the Unicode encoding of "you".
Note the first 3 1 of the UTF-8, which means that the entire UTF-8 string is composed of 3 bytes.
After UTF-8 encoding, the sensitive characters are no longer present because the highest bit is always 1.

The following is a table of conversion relationships between Unicode and UTF-8:
U-00000000-u-0000007f:0xxxxxxx
U-00000080-u-000007ff:110xxxxx 10xxxxxx
U-00000800-u-0000ffff:1110xxxx 10xxxxxx 10xxxxxx
U-00010000-u-001fffff:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000-U-03FFFFFF:111110XX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
u-04000000-u-7fffffff:1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Unicode encoding is converted to UTF-8, which simply turns the Unicode byte stream into X and becomes UTF-8.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.