The coding aspect has always been not very high, so it is not known about Unicode and UTF-8.
Recently accidentally turned to a UTF-8 article, feel the explanation of the very complex, so just thought to write a simple and understandable.
Let's begin by explaining some of the coding schemes that are now commonly used:
1, in China, the most commonly used in the mainland is GBK18030 code, in addition to the gbk,gb2312, the relationship between these several encodings is this.
The earliest encoding were GB2312, including 6,763 Chinese characters and 682 other symbols.
95 revised the code, named GBK1.0, a total of 21,886 symbols were included.
After the introduction of the GBK18030 code, a total of 27,484 Chinese characters, but also included Tibetan, Mongolian, Uighur and other major minority languages, now Windows platform must support GBK18030 coding.
According to the order of GBK18030, GBK and GB2312, 3 encodings are backward compatible, and the same characters are encoded in three coding schemes.
2, Taiwan, Hong Kong and other uses are BIG5 code
3, Japan: Sjis code
Unicode is a language developed by all the countries in the world if we describe all kinds of text coding as dialects of different places.
In this language environment, there will be no more language coding conflicts, under the same screen, can display any language content, this is the greatest advantage of Unicode.
So how is Unicode encoded? actually very simple.
is to encode all the text in the world in 2 bytes. You might ask, 2 bytes can represent up to 65,536 encodings, is it enough?
Most of the Chinese characters in Korea and Japan are spread from China, and the font is exactly the same.
For example: "Wen" word, GBK and sjis are all the same characters, but the code is different.
That way, a unified code like this, 2 bytes is enough to accommodate most of the language in the world.
The scientific name of Unicode is "universal Multiple-octet coded Character Set", referred to as UCS.
Now we are using UCS-2, which is 2 byte encoding, and UCS-4 is developed to prevent the future of 2 bytes from being insufficient. UCS-2 is also known as the basic multilingual plane.
UCS-2 conversion to UCS-4 is simply preceded by a 2 byte 0.
UCS-4 is primarily used to save auxiliary planes, such as the second auxiliary plane in Unicode 4.0
20000-20fff-21000-21fff-22000-22fff-23000-23fff-24000-24fff-25000-25fff-26000-26fff-27000-27fff-28000-28 Fff-29000-29fff-2a000-2afff-2f000-2ffff
A total of 16 auxiliary planes were added, extending from the original 65,536 encodings to nearly 1 million encodings.
So since the unified coding, how to compatible with the original state of the text encoding it?
This time will need codepage.
What is codepage? CodePage is the mapping table between the text encoding and Unicode of each country.
For example, the Simplified Chinese and Unicode mapping table is CP936, click here to view the official mapping table.
Here are a few commonly used codepage, the corresponding changes to the above address of the number can be.
codepage=936 Simplified Chinese GBK
codepage=950 Traditional Chinese BIG5
codepage=437 American/Canadian English
codepage=932 Japanese
codepage=949 Han Wen
codepage=866 Russian
codepage=65001 Unicode UFT-8
The last 65001, according to personal understanding, should be just a virtual mapping table, is actually just an algorithm.
Take a line from 936, for example:
0x9993 0X6ABD #CJK Unified Ideograph
The preceding encoding is GBK encoding, followed by Unicode.