Differences between ASCII, gb2312, Unicode, and UTF-8 in Python

Source: Internet
Author: User
Tags control characters locale setting

ASCII is a character set, including uppercase and lowercase English letters, numbers, and control characters. It is represented in one byte and ranges from 0 to 127.

Unicode is divided into UTF-8 and UTF-16. UTF-8 variable length, up to 6 bytes, less than 127 characters are represented in one byte, the same as the results of the ASCII character set, english text under ASCII encoding can be treated as UTF-8 encoding without modification.

Python supports Unicode since 2.2,The decode (char_set) function can convert other codes to Unicode.,Function encode (char_set) to convert Unicode to other encoding methods.

For example, ("hello"). Decode ("gb2312") will get U'/u4f60/u597d ', that is, the Unicode codes "you" and "good" are 0x4f60 and 0x597d respectively.
Reuse (U'/u4f60/u597d '). encode ("UTF-8") will get '/xe4/xbd/xa0/xe5/xa5/xbd', which is the result of "hello" UTF-8 encoding.

The key to using Unicode in Python:Unicode is a class. The Unicode (STR, "utf8") function generates Unicode class objects from the string STR encoded by utf8 (of course, it can be another encoding )., AndFunction unc. encode ("utf8") converts the Unicode Class Object UNC to (encoded as) a UTF-8 encoded string (of course, it can also be another encoded string ).. Therefore, write Unicode-relatedProgramWhat needs to be done is

* When obtaining data (string), Unicode (STR, "utf8") is used to generate a unicode object.
* Only Unicode objects are used in the program. string constants in the program are written in the form of U "string ".
* During output, Unicode objects can be converted to any encoded Output Using Str. encode ("some_encoding ")

>>> Unicode ("hello", "utf8 ")
U'/u4f60/u597d'
>>> X = _
>>> Type (X)

>>> Type ("hello ")

>>> X. encode ("utf8 ")
'/Xe4/xbd/xa0/xe5/xa5/xbd'
>>> X. encode ("GBK ")
'/Xc4/xe3/Xba/xc3'
>>> X. encode ("gb2312 ")
'/Xc4/xe3/Xba/xc3'
>>> Print x
Hi!
>>> Print X. encode ("utf8 ")
Hi!
>>> Print X. encode ("GBK ")
???

The above is the test result (Ubuntu 6.06, locale is utf8), pay attention to the difference between type (X) and type ("hello. We can see from the encoding that utf8 encoding is different from GBK. In the locale setting of utf8, printing X is encoded according to the environment variable (I guess), while printing X. encode ("GBK") is garbled.

 

Reprinted statement:This article from http://1.vb.blog.163.com/blog/static/104546220071113105047729/

 

========================================================== ==================================

 

Recommendation reference:

 

Python, Unicode, and Chinese

Http://blog.csdn.net/lwl_ls/archive/2007/08/21/1753284.aspx

 

Unicode, GBK, UTF-8 differences

Http://blog.csdn.net/Sunboy_2050/archive/2010/12/16/6080008.aspx

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.