ASCII and Unicode, codePage, UTF-8

Source: Internet
Author: User

1. ASCII

ASCII (American Standard Code for information interchange, US Information Interchange Standard)CodeIs a computer coding system based on Latin letters. It is mainly used to display modern English and other Western European languages. It is the most common single-byte encoding system today and is equivalent to the international standard ISO/IEC 646.

Because the number of one-bit binary data can be expressed as (2 =) two states: 0 and 1, while the number of two-bit binary data can be expressed as (2 =) and four states: 00, 01, 10, 11, and so on. A 7-bit binary number can represent 128 States. Each State is uniquely encoded as a 7-bit binary code, corresponds to one character (or control code), which can be arranged into a decimal number ranging from 0 ~ 127. Therefore, the 7-bit ASCII code is encoded using the seven-bit binary number, which can represent 128 characters. 0th ~ 32 and 127th (34 in total) are control characters or communication special characters, such as control characters: LF (line feed), Cr (Press ENTER), FF (page feed), del (delete) BS, Bel, special characters for communications: Soh, EOT, ack, etc.; 33rd ~ 126 (94 in total) is a character, of which 48th ~ Number 57 is 0 ~ 9. 10 Arabic numerals; 65 ~ The 90 is 26 uppercase English letters, 97 ~ There are 26 lower-case English letters, and the remaining are some punctuation marks and operator numbers. Note: In a computer's storage unit, an ascii code value occupies one byte (eight binary digits), and its highest bit (B7) is used as the parity bit. The so-called parity check refers to a method used to check whether an error occurs during code transfer. It is generally divided into two types: Odd checksum and even verification. Odd check rules: correct code must contain an odd number of 1 bytes. If the number is not an odd number, 1 is added to the highest bit B7. Even check rules: correct code: the number of 1 in a byte must be an even number. If the number is not an even number, 1 is added to the highest bit of B7. An ascll consists of eight binary digits. There are 7 binary codes used to express characters, and the last one is used to detect errors or be idle.

 

2. Unicode

Unicode (unified code, universal code, Single Code) is a character encoding used on a computer. It sets a unified and unique binary code for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. R & D started in December 1990 and officially announced in December 1994. With the enhancement of computer capabilities, Unicode has been popularized in more than a decade since its launch.
Unicode uses two bytes to represent a single character. Unicode defines a character set that is large enough to represent all human readable characters and can hold all the characters and symbols in the world. Unicode maps these characters with numbers 0-0x10ffff. It can contain up to 1114112 characters, or contain 1114112 characters. The bitwise is the number that can be allocated to characters. UTF-8, UTF-16, UTF-32 are all convert numbersProgramData encoding scheme.

 

So how can Unicode be compatible with the text encoding from different countries? Such as gb2312, GBK, Japanese, and Korean in China.
CodePage is required at this time.
 

3. codePage

What is codePage? CodePage is the ing between text encoding and Unicode in different countries.
For example, the ing table between Simplified Chinese and Unicode is cp936. Click here to view the official ing table.

The following are several frequently used codepages. Modify the numbers of the above addresses accordingly.
CodePage = 936 Simplified Chinese GBK
CodePage = 950 traditional Chinese big5
CodePage = 437 US/Canada English
CodePage = 932 Japanese
CodePage = 949 Korean
CodePage = 866 Russian

Randomly retrieve a row from 936, for example:
0x9993 0x6abd # CJK uniied ideograph
The preceding encoding is GBK encoding followed by Unicode.
By checking this table, you can easily convert between GBK and Unicode.

Now that we understand Unicode, What Is UTF-8? Why is there a UTF-8?

 

4. Utf-8

 

It turns out that it is not efficient to use Unicode for characters that can be expressed in ASCII, because Unicode is twice the space occupied by ASCII, and 0 in ASCII is useless. To solve this problem, some intermediate formats of character sets emerged. They are called universal conversion formats (UTF ). The existing UTF formats include: UTF-7, UTF-7.5, UTF-8, UTF-16, and UTF-32.

UTF-8 (8-bit Unicode Transformation Format) is a variable length character encoding (Fixed Length Code) for Unicode, is also a type of Prefix code. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, which makes the software that originally processes ASCII characters do not need to or only need to make a few modifications, you can continue to use it. As a result, it has gradually become an application for storing or transmitting text in e-mails, web pages, and other applications, with the preferred encoding.

The UTF-8 uses one to four bytes to encode each character:
1.128 US-ASCII characters are encoded in only one byte (UNICODE ranges from u + 0000 to U + 007f ).
2. the Latin, Greek, Spanish, Armenia, Hebrew, Arabic, Syrian, and letters with additional symbols must be encoded in two bytes (UNICODE ranges from u + 0080 to U + 07ff ). ).
3. Other characters in the basic multilingual plane (BMP) (which contains most common words) are encoded in three bytes.
4. Other rarely used Unicode secondary Flat Characters are 4-byte encoded.
For the fourth character mentioned above, the UTF-8 uses four bytes for encoding seems too resource-consuming. But the UTF-8 can be represented in three bytes for all common characters, and its alternative, UTF-16 encoding, also requires four bytes to encode the fourth character, so decide which encoding of the UTF-8 or UTF-16 is more efficient, depending on the distribution range of the characters used. However, if some traditional compression systems such as deflate are used, the differences between these different encoding systems become insignificant. If traditional compression is taken into accountAlgorithmIt is not very effective in compressing short texts. You can consider using the Unicode Standard compression format (scsu ).

The Internet Engineering team (IETF) requires that all Internet protocols support UTF-8 encoding. [1] The Internet Mail Alliance (IMC) recommends that all email software support UTF-8 encoding. Only Eudora of all major email software does not support UTF-8 encoding.

 

 

5. GBK and gb2312

China has set a GB 2312, but only 6763 Chinese characters are included. Many Chinese characters are not included. As a result, Microsoft developed the GBK code table cp936, which was first implemented in the Simplified Chinese version of Windows 95.

That is to say, GBK itself is not a national standard, but was published by the Standardization Department of the State Bureau of Technical Supervision and the Science and Technology and Quality Supervision Department of the Ministry of electronics industry as a "guiding document on technical specifications ". Later, the National gb18030 standard is technically compatible with GBK. Obviously, it is accepted.

 

Refer:

ASCII
Http://baike.baidu.com/view/15482.htm

Unicode
Http://baike.baidu.com/view/40801.htm
Http://zh.wikipedia.org/wiki/Unicode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.