GBK, UTF-8 and Unicode encoding issues in Python

Last Update:2015-07-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Coding problem has always been a piece of heart disease when using python2. Almost all console input and output, IO operations, and HTTP operations involve coding issues such as the following:



UnicodeDecodeError:‘ascii’codec can’t decodebyte0xc4inposition10:ordinalnotinrange(128)

What the hell is this thing?! Sometimes it is silly to use a lump of encode (), decode () and other functions so that the program can run the right, but the next time you encounter non-ASCII encoding and tragic.

So what exactly is a string in Python 2.x?

Basic Coding Knowledge

Before we know the nature of the string in Python, we need to know the exact geometry of the relationship between ASCII, GBK, UTF-8, and Unicode.
We know that any string is a sequence of binary bytes, and the ASCII code is the most classic encoding, which understands each byte in the sequence as one character and can represent 128 different characters, including Arabic numerals and letters. It is obvious that Chinese characters cannot be expressed in ASCII.
In order to enable the computer to display and process Chinese characters, the industrious and simple Chinese people have developed the GBK (GB2312 extension) encoding, which is an ASCII-compliant indefinite length (length of 1-2) encoding, for the basic 128 characters are still in one byte, but "Xiang" in Chinese is expressed in two bytes:

Similar to GBK, UTF-8 is also an indefinite-length encoding that is compatible with ASCII codes, which vary in length and can therefore represent almost all world text. For specific details, refer to Wiki: http://zh.wikipedia.org/wiki/UTF-8

Unicode is a fixed-length encoding (same as ASCII), but it is every 2 bytes considered to be a character, such as ASCII in 0x61 means ' a ', in Unicode with 0x0061 for ' a ', it can map all the text, and for a variety of words, such as strong/powerful, It can all be the only region to divide them.

Because Unicode-encoded strings are large in size, Unicode encoding is generally only an intrinsic form of text in memory, where specific storage (such as files, Web pages, and so on) needs to be interpreted by external encodings (UTF-8, GBK, etc.).

The nature of strings in python2.x

There are actually two types of strings in Python, the str type and the Unicode type, both of which are derived classes of basestring. The differences are as follows:

String type	Constant quantum string representation	In-memory representation	Len ()	Len meaning
Str	s= "hehe"	Exactly the same as the source file, a lump of binary code	If the source file is UTF-8 encoded, Len (S) =6	Number of bytes
Unicode	S=u "hehe"	Unicode Encoding	Len (S) =2	Words

The nature of the STR type is a bunch of binary strings, the encoding of the source file (or the Retrieved Web page), and what it does. In fact, Python does not know exactly what encoding a str string is. This also explains why we need to calibrate the file's encoding at the beginning of the Python file, such as:



# encoding: utf-8

It also explains why Len (), a string of type str, only returns the number of bytes It occupies in memory, not the number of words .
Compared to Str,unicode is the real string. Python explicitly knows its encoding, so it can confidently get the actual word count of a string.

String encoding conversions: Encode () and Decode ()

The most commonly used encoding conversion functions of Python are encode () and Decode (), whose nature is:Unicode and STR are converted to each other .
In specific terms:
Encode (encoding): Converts Unicode to STR and uses encoding encoding;
Decode (encoding): Converts STR to Unicode, where STR is encoded encoding.

Let's look at an example:

#encoding: Utf-8S=Hello# The entire file is UTF-8 encoded, so the string here is also UTF-8U=S.Decode("Utf-8")# convert Utf-8 str to UnicodeG=U.Encode(' GBK ')# convert Unicode to STR, encoded as GBKPrintType(S),"Len=",Len(S)# Output: <type ' str ' > len= 6,utf-8 each Kanji account 3 bytesPrintType(U),"Len=",Len(U)# Output: <type ' str ' > len= 6,unicode count the number of wordsprinttype (g "len=" ,len (g) # output: G = U.encode (' GBK '), GBK each kanji account for 2 bytes prints# in gbk/ansi environment (e.g. Windows ), output garbled, #因为此时屏幕输出会被强制理解为GBK; Linux Under Show normal printg# output "Hello" under windows, #Linux (UTF-8 environment) error, the same reason.

The results of the operation under Windows7 (Chinese) are as follows:



<type‘str ’> len = 6 <type‘unicode’> len = 2 <type‘str ’> len = 4
Huan
Hello there

Traceback (most recent call last):
   File "C: /Users/Sunicy/Desktop/encode.py", line 15, in <module> g.decode (‘utf-8’)
   File "C: \ Python27 \ lib \ encodings \ utf_8.py", line 16, in decode
     return codecs.utf_8_decode (input, errors, True)
UnicodeDecodeError: ‘utf8’ codec ca n’t decode byte 0xc4 in position 0: invalid continuation byte

Determine if a variable is a string

We know that Python determines whether a variable uses isinstance (variable, type) functions for a type, such as



isinstance(1.2,float)

The return value is True

Then judging if the variable is not a string can be used



isinstance(s,str)

It?

The answer is in the negative.
Now we know that in addition to STR, the Unicode type is also a string, so the above code returns False if it encounters a Unicode string.
Visually improving is both judging str and judging Unicode:



isinstance(s,str)orisinstance(s,unicode)

But this method works, but it's a little silly. Since STR and Unicode are derived from basestring, it is most secure to use basestring as the type:



isinstance(s,basestring)

Here is a set of examples:

Isinstance("AAA",Str)#-TrueIsinstance({},Dict)

#-TrueIsinstance([1,],List)#-Trueisinstance("AAA",list)#-Falseisinstance("You", str)

#-Falseisinstance("Hello",basestring)#-True  Isinstance("AAA",basestring)#-True

Summarize

Unicode is a uniform encoding that supports all text, but is generally used only as an internal representation of text, where files, Web pages (also files), screen input and output, etc., need to use specific external codes such as GBK, UTF-8, etc.
Encode and Decode are both "encode" and "decode" for Unicode, so encode is the UNICODE->STR process, decode is the process of str->unicode;
Unicode and Str are twins, from basestring, so use Isinstance (S, basestring) to determine if S is a string.

GBK, UTF-8 and Unicode encoding issues in Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

GBK, UTF-8 and Unicode encoding issues in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

GBK, UTF-8 and Unicode encoding issues in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support