ASCII, gb2312, Unicode, UTF-8

Source: Internet
Author: User
Tags control characters
ASCII is a character set, including uppercase and lowercase English letters, numbers, and control characters. It is represented in one byte and ranges from 0 to 127.

Because ASCII characters are very limited, each country or region puts forward its own character set on this basis. For example, gb2312, which is widely used in China, provides encoding for Chinese characters, it is expressed in two bytes.

These character sets are incompatible with each other. The same number may indicate different characters, which makes information exchange troublesome.

Unicode is a character set that maps all characters in the world into a unique number (Code Point), such as the number 0x0041 corresponding to letter. Unicode is still in development, and more characters are supported.

A certain encoding method, such as a UCS-2, is also required to store characters represented by Unicode, which uses two bytes to represent Unicode-encoded characters. While UTF-8 is another encoding method of the Unicode character set, it is a variable length, up to 6 bytes, less than 127 characters are represented in one byte, the same as the results of the ASCII character set, therefore, it has a very good compatibility. The English text in ASCII encoding can be processed as a UTF-8 without modification. It is widely used.

Python supports Unicode from 2.2. the decode (char_set) function can convert other encodings to Unicode. The function encode (char_set) can convert Unicode to other encoding methods, the Unicode string here refers to the code points encoded by a UCS-2 or UCS-4.

For example, ("hello"). Decode ("gb2312") will get U' \ u4f60 \ u597d ', that is, the Unicode codes "you" and "good" are 0x4f60 and 0x597d respectively.
Reuse (U' \ u4f60 \ u597d '). encode ("UTF-8") will get '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd', which is the result of "hello" UTF-8 encoding.

References:
The absolute minimum every software developer absolutely, positively must know about Unicode and character sets (no excuses !) (Joel Spolsky)

Unicode for programmers (des Unicode in Python) (Jason orendorff)

Python Unicode objects (Fredrik lundh)

Python Unicode tutorial (reportlab)

End to end Unicode Web applications in Python (Martin doudoroff)

Unicode in Python (Thijs van der Vossen)

Unicode official website http://www.unicode.org/

Unicode description

Gb2312 Character Set

Introduction to UCs and UTF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.