Character encoding ASCII, Unicode and UTF-8, asciiutf-8

Source: Internet
Author: User

Character encoding ASCII, Unicode and UTF-8, asciiutf-8

Http://blog.csdn.net/pipisorry/article/details/42387045

ASCII code

The ASCII code consists of A total of 128 characters. For example, the SPACE is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte, and the first one digit is set to 0.


Non-ASCII Encoding

It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For example, E in French is encoded as 130 (Binary 10000010 ). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.

However, different countries have different letters. Even if they all use 256 characters, they represent different letters. For example, 130 represents é in French encoding, but in Hebrew encoding represents the letter Gimel (delimiter). In Russian encoding, it represents another symbol. However, in all these encoding methods, the 0--127 represents the same symbol, but the difference is only the 128--255 section.

As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters.

Although they all represent a symbol with multiple bytes, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 of the Post-article.


Unicode

As mentioned above, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.

As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.

Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. For example, U + 0639 represents the Arabic letter Ain, U + 0041 represents the English capital letter A, and U + 4E25 represents the Chinese character "Yan ". You can query a specific symbol table at unicode.org or a special Chinese character table.


Unicode Problems

It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the unicode of Chinese character "Yan" is a hexadecimal number of 4 E25, and the number of bytes converted to binary is 15 (100111000100101). That is to say, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.

There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, two to three bytes in front of each English letter must be 0, which is a huge waste for storage. Therefore, the size of the text file is two or three times larger, which is unacceptable.

The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.


UTF-8

With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet.

Other implementations also include UTF-16 (characters are expressed in two or four bytes) and UTF-32 (characters are expressed in four bytes), but are basically not needed on the Internet.UTF-8 is one of Unicode implementations.

The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.

Encoding Rules for UTF-8Very simple, there are only two:

1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.

2) for the n-byte symbol (n> 1), the first n bits of the first byte are set to 1, and the n + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all unicode codes of this symbol.


The following table summarizes the encoding rules. The letter x indicates the available encoding bits.

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007F | 0 xxxxxxx
0000 0080-0000 07FF | 110 xxxxx 10 xxxxxx
0000 0800-0000 FFFF | 1110 xxxx 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

According to the above table, the interpretation of UTF-8 encoding is very simple. If the first byte is 0, the byte is a single character. If the first byte is 1, the number of consecutive 1 represents the number of bytes occupied by the current character.

Next, take Chinese character "Yan" as an example to demonstrate how to implement UTF-8 encoding.

It is known that the unicode of "strict" is 4E25 (100111000100101). According to the preceding table, we can find that 4E25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 xxxx 10 xxxxxx 10 xxxxxx ". Then, from the last binary bit of "strict", enter x in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is E4B8A5.


Conversion between Unicode and UTF-8

Through the example in the previous section, we can see that the Unicode code of "strict" is 4E25, The UTF-8 code is E4B8A5, the two are different. The conversion between them can be implemented through a program.

On the Windows platform, there is one of the simplest transformations. Instead, you can use the built-in deployment mini-program notepad.exe. After opening the file, click "Save as" in the "file" menu. A dialog box is displayed, with a "encoding" drop-down at the bottom.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding method. English files are ASCII encoded files, while simplified Chinese files are GB2312 encoded files (only for Windows Simplified Chinese versions, if they are traditional Chinese versions, Big5 codes will be used ).

2) Unicode encoding refers to the UCS-2 encoding method, that is, directly using two bytes into the character Unicode code. This option uses the little endian format.

3) Unicode big endian encoding corresponds to the previous option. In the next section, I will explain the meanings of little endian and big endian.

4) UTF-8 coding, that is, the encoding method mentioned in the previous section.

After selecting "encoding method", click "save" to convert the file encoding method immediately.


GB2312

In 1980, China made a GB2312-80 containing a total of 7445 characters, including 6763 Chinese characters and 682 other symbols.

GB2312-80, referred to as GB2312.

In Windows, the Code Page is CP936.


GBK

Microsoft, the expansion of the GB2312-80, that is, the use of GB 2312-80 unused encoding space, including all the GB 13000.1-93 and Unicode 1.1 of all Chinese characters, developed a GBK encoding.

GBK contains 21886 symbols, which are divided into the Chinese character area and the graphic symbol area. The Chinese Character area contains 21003 characters.

As an extension of GB2312, GBK still uses the code page CP936 in the current Windows system, but the same 936 code page and the first 936 code page only support GB2312 encoding, the current 936 code page supports GBK encoding, and GBK is also backward compatible with GB2312 encoding.

Therefore, in terms of technical coding, GBK is compatible with the old GB2312, but the encoding method is different from that of GB13000. It is not compatible with GB13000, but it is the same as that of GB13000.


In terms of technical coding, the evolutionary sequence is:ASCII GB2312 GBK GB18030


Appendix:

Chinese character encoding standard

Encoding standard Alias Standard Contains characters
ASCII   International  
GB2312 CP936 in Microsoft Windows Mainland China 6763 Chinese characters and 682 other symbols
Unicode 1.1   International 20,902 characters
GB13000   Mainland China 20,902 characters
GBK CP936 in Microsoft Windows Microsoft 21886 symbols
GB18030 CP54936 in Microsoft Windows Mainland China 27484 Chinese characters + other minority characters
Character (storage) Exchange Standard
Character encoding standard Storage (exchange/transmission) Standards
Contains characters Name of the character encoding Field
English
ASCII
= ISO/IEC 646
ASCII
Many European characters ISO 8859  
General (any) characters Unicode UTF-8
UTF-16
UTF-32
Simplified Chinese
GB2312
= GB2312-80
= GB
= GB0
EUC-CN
GBK
GB18030
Traditional Chinese
BIG5
= 5 yards
= 5 large size
CCCII
CNS-11643
EUC-TW
Japanese
JIS * 0208
= Jis c 6226 and jis x 0212
Shift JIS
ISO-2022-JP
EUC-JP
JIS * 0213 EUC-JISX0213
Korean
Ks x 1001
= Ks c 5601
EUC-KR


From: http://blog.csdn.net/pipisorry/article/details/42387045

Ref: * The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Detailed description of character encoding

Character encoding note


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.