Code Historian _ Other integrated

Last Update:2017-01-18 Source: Internet

Author: User

Tags control characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

the story of character encoding
What is the character
What is the character? is a meaningful graphic, such as a, medium. In different countries represent different meanings.

But in the computer world there are only 0 and 1, OK, how do you represent these characters in 0 and 1? This is the meaning of coding existence.

Coding is not very advanced, is a computer 01 and character AB simple mapping.

so the story begins ...
Once upon a time, there were only Americans in the computer world. American writing is very low, all of their text is only 24 letters, even with uppercase and lowercase, Arabic numerals, the computer control characters (carriage return what) are not more than 256 (only 127). So, for them, it's natural that the computer uses 8 bits to represent all of their characters. So they call 8 bits a byte, and each number represented by the 8 digits of the computer corresponds to an English character, drawing a table (an ASCII-code table). The earliest encoded ASCII code appeared.

The Europeans appeared. Europe has a good number of countries, each of their countries have their own words, such as Latin, Greek and so on. What do we do? So think, you Americans specify the ASCII code table is not only 127 characters, the following 128-255 characters is not to be determined, OK, we will not be polite. So the Europeans crammed all sorts of strange languages into the characters behind 127, forming a series of ISO 8859 character sets. For example, the Greek text into ASCII, the formation of ISO/IEC 8859-7, Western European languages into ASCII formed the ISO/IEC 8859-1,iso/iec 8859-1 also known as latin-1. (Yes, the code that is often seen in MySQL)

The following are the existing 15 character sets for ISO 8859

ISO/IEC 8859-1 (Latin-1)-Western European languages
ISO/IEC 8859-2 (LATIN-2)-Central European languages
ISO/IEC 8859-3 (Latin-3)-Southern European languages. Esperanto can also be displayed in this character set.
ISO/IEC 8859-4 (Latin-4)-Nordic languages
ISO/IEC 8859-5 (Cyrillic)-Slavic language
ISO/IEC 8859-6 (Arabic)-Arabic
ISO/IEC 8859-7 (Greek)-Greek
ISO/IEC 8859-8 (Hebrew)-Hebrew (visual order)
ISO 8859-8-i-Hebrew (logical order)
ISO/IEC 8859-9 (Latin-5 or Turkish)-it wraps the Latin-1 Icelandic alphabet and joins the Turkish alphabet.
ISO/IEC 8859-10 (Latin-6 or Nordic)-Northern Germanic branch, used to replace Latin-4.
ISO/IEC 8859-11 (Thai)-Thai, evolved from the TIS620 standard Word set in Thailand.
ISO/IEC 8859-13 (Latin-7 or Baltic Rim)-Baltic languages
ISO/IEC 8859-14 (LATIN-8 or Celtic)-Celtic languages
ISO/IEC 8859-15 (LATIN-9)-Western European languages, add Latin-1-deficient Finnish and uppercase accent letters, and euro (?) symbols.
ISO/IEC 8859-16 (LATIN-10)-South-east European languages. It is mainly used in Romanian and is added to the euro symbol.

Then the great Chinese began to use the computer. Chinese can be extremely, the text is profound, the character is far more than 256. So we can't use the extension of ASCII. What do we do? In the 1981, the country sent a group of people to do this thing, they counted all the Chinese about 6,000 characters (later proved that these people's water is also limited, many characters are not found, so there are a variety of Chinese coding), with two bytes (16bit) to express, 16bit can represent 65,536 characters, that's enough. We will 16bit into the front 8bit and after 8bit
If the first 8bit is less than 127 (ASCII in English), then this 8bit is meant for English
If the top 8bit is greater than 127, then the 8bit and the back 8bit together represent a Chinese
What does GB mean? Gb.

Well, some of the leaders later found out that his name couldn't be coded, and the problem came out. 6,000 Chinese characters are not enough to cover all the Chinese, the country in 1995 organized a group of people, continue to collect some rare words, a total of 21,886 Chinese characters and characters, formed a GBK encoding, GBK coding backward GB2312.

What does K mean? Extended.

Then found that some Manchu, Mongolian what the minority of the language is not edited into the GBK, continue to edit included, formed a GB18030 code.

The people of Taiwan of China, of course, cannot use the GBXX series encoded by the mainland editors, so they have their own set of BIG5 Chinese code, which includes 13,060 characters and characters. Note here, however, that BIG5 's coded mapping table and GBXX series are completely different, compared to a "medium" word, which is two distinct bytes in BIG5 and GB2312. There will be garbled, such as ("Tao Zhe" and "Taujigi"), a variety of Simplified Chinese and traditional transcoding tools appear.
What do you mean, BIG5?
Five Chinese packaged software: Clerical processing, database, trial table, communication, drawing. Basically, this set of codes is mainly used in these 5 areas.

Is it cumbersome for countries to use each country's own code? So everyone is looking forward to a unified code form. The Unicode encoding appears. The common character set used by Unicode is called UCS. This character set is a large character space in which each language divides a field within this character space. Now the application of the UCS is UCS-2, meaning that whether in English or Chinese, unified use of two bytes (16bit) for character assignment. The UCS-2 character set can represent 216 (or 65536) characters. Has basically met all the languages in the world. What if it's not enough? There is already a predetermined scheme UCS-4 (representing one character in 4 bytes).

Remember: UTFXX is the specific way to implement Unicode.
UTF-16 is the most basic implementation of Unicode. Unicode uses 16bit to represent a character, and UTF-16 is the direct mapping of character sets.

It would have been nice, but the Americans had quit. What is the use of each English character to occupy 2 bytes? Why do we have to occupy our bandwidth and CPU? So a gang of English-speaking foreigners discussed the character encoding of UTF-8.
UTF-8 What about this code?
English characters, like ASCII code, occupy a byte
Other languages, each language is assigned a template, this template has 16bit,24bit, and even 32bit. According to the template, each language translates its own words into the code required by the template (UTF-8)

This shows a Chinese character "Han"
For example, Chinese to the template is 1110xxxx 10yyyyyy 10zzzzzz
The Unicode encoding of Chinese characters is 0x6c49, and the binary is 0110 1100 0100 1001.
Insert the binary in the x,y,z order of the template
Get 11100110 10110001 10001001 is E6 B1 89

All right... Can you see what's wrong with the Chinese? Originally a Chinese use UTF-16 only need two bytes, but use UTF-8 but need 3 bytes, if a page has 1w text, then we need to transmit 1w bytes, bandwidth Ah!! Now understand, why some websites in China, such as Sina, its coding rule is to use GBK!

Let's talk a little bit about the automatic encoding matching problem for many editors. The editor will check whether the character you entered is UTF-8 or GBK, basically based on the UTF-8 template, if it conforms to the template, it will be UTF-8. A lot of article said txt input "Unicom" Save for GBK code and then open will appear garbled is this cause.
Please read this article specifically

What's another kind of ANSI? The Windows kernel is written using UTF-16, but the language displayed on the page is displayed according to the language of the system settings. ANSI is an encoding that Windows systems automatically change according to the locale you set up. For example, in the Chinese Windows system, ANSI on behalf of GBK encoding, Japanese operating system on behalf of JIS code.

reference materials

The explanation and origin of character coding
Character Set and character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More