GBK, UTF-8 and Unicode encoding issues in Python

Source: Internet
Author: User


Coding problem has always been a piece of heart disease when using python2. Almost all console input and output, IO operations, and HTTP operations involve coding issues such as the following:


UnicodeDecodeError:‘ascii’codec can’t decodebyte0xc4inposition10:ordinalnotinrange(128)


What the hell is this thing?! Sometimes it is silly to use a lump of encode (), decode () and other functions so that the program can run the right, but the next time you encounter non-ASCII encoding and tragic.



So what exactly is a string in Python 2.x?


Basic Coding Knowledge


Before we know the nature of the string in Python, we need to know the exact geometry of the relationship between ASCII, GBK, UTF-8, and Unicode.
We know that any string is a sequence of binary bytes, and the ASCII code is the most classic encoding, which understands each byte in the sequence as one character and can represent 128 different characters, including Arabic numerals and letters. It is obvious that Chinese characters cannot be expressed in ASCII.
In order to enable the computer to display and process Chinese characters, the industrious and simple Chinese people have developed the GBK (GB2312 extension) encoding, which is an ASCII-compliant indefinite length (length of 1-2) encoding, for the basic 128 characters are still in one byte, but "Xiang" in Chinese is expressed in two bytes:



Similar to GBK, UTF-8 is also an indefinite-length encoding that is compatible with ASCII codes, which vary in length and can therefore represent almost all world text. For specific details, refer to Wiki: http://zh.wikipedia.org/wiki/UTF-8



Unicode is a fixed-length encoding (same as ASCII), but it is every 2 bytes considered to be a character, such as ASCII in 0x61 means ' a ', in Unicode with 0x0061 for ' a ', it can map all the text, and for a variety of words, such as strong/powerful, It can all be the only region to divide them.



Because Unicode-encoded strings are large in size, Unicode encoding is generally only an intrinsic form of text in memory, where specific storage (such as files, Web pages, and so on) needs to be interpreted by external encodings (UTF-8, GBK, etc.).


The nature of strings in python2.x





There are actually two types of strings in Python, the str type and the Unicode type, both of which are derived classes of basestring. The differences are as follows:


String type

Constant quantum string representation

In-memory representation

Len ()

Len meaning

Str

s= "hehe"

Exactly the same as the source file, a lump of binary code

If the source file is UTF-8 encoded,
Len (S) =6

Number of bytes

Unicode

S=u "hehe"

Unicode Encoding

Len (S) =2

Words


The nature of the STR type is a bunch of binary strings, the encoding of the source file (or the Retrieved Web page), and what it does. In fact, Python does not know exactly what encoding a str string is. This also explains why we need to calibrate the file's encoding at the beginning of the Python file, such as:




# encoding: utf-8


It also explains why Len (), a string of type str, only returns the number of bytes It occupies in memory, not the number of words .
Compared to Str,unicode is the real string. Python explicitly knows its encoding, so it can confidently get the actual word count of a string.


String encoding conversions: Encode () and Decode ()


The most commonly used encoding conversion functions of Python are encode () and Decode (), whose nature is:Unicode and STR are converted to each other .
In specific terms:
Encode (encoding): Converts Unicode to STR and uses encoding encoding;
Decode (encoding): Converts STR to Unicode, where STR is encoded encoding.



Let's look at an example:


#encoding: Utf-8S=Hello# The entire file is UTF-8 encoded, so the string here is also UTF-8U=S.Decode("Utf-8")# convert Utf-8 str to UnicodeG=U.Encode(' GBK ')# convert Unicode to STR, encoded as GBKPrintType(S),"Len=",Len(S)# Output: <type ' str ' > len= 6,utf-8 each Kanji account 3 bytesPrintType(U),"Len=",Len(U)# Output: <type ' str ' > len= 6,unicode count the number of wordsprinttype (g "len=" ,len (g) # output: G = U.encode (' GBK '), GBK each kanji account for 2 bytes prints# in gbk/ansi environment (e.g. Windows ), output garbled, #因为此时屏幕输出会被强制理解为GBK; Linux Under Show normal printg# output "Hello" under windows, #Linux (UTF-8 environment) error, the same reason. 


The results of the operation under Windows7 (Chinese) are as follows:




<type‘str ’> len = 6 <type‘unicode’> len = 2 <type‘str ’> len = 4
Huan
Hello there

Traceback (most recent call last):
   File "C: /Users/Sunicy/Desktop/encode.py", line 15, in <module> g.decode (‘utf-8’)
   File "C: \ Python27 \ lib \ encodings \ utf_8.py", line 16, in decode
     return codecs.utf_8_decode (input, errors, True)
UnicodeDecodeError: ‘utf8’ codec ca n’t decode byte 0xc4 in position 0: invalid continuation byte 
Determine if a variable is a string


We know that Python determines whether a variable uses isinstance (variable, type) functions for a type, such as




isinstance(1.2,float)


The return value is True



Then judging if the variable is not a string can be used




isinstance(s,str)


It?



The answer is in the negative.
Now we know that in addition to STR, the Unicode type is also a string, so the above code returns False if it encounters a Unicode string.
Visually improving is both judging str and judging Unicode:




isinstance(s,str)orisinstance(s,unicode)


But this method works, but it's a little silly. Since STR and Unicode are derived from basestring, it is most secure to use basestring as the type:




isinstance(s,basestring)


Here is a set of examples:


Isinstance("AAA",Str)#-TrueIsinstance({},Dict)
#-TrueIsinstance([1,],List)#-Trueisinstance("AAA",list)#-Falseisinstance("You", str)
#-Falseisinstance("Hello",basestring)#-True  Isinstance("AAA",basestring)#-True        
Summarize
    1. Unicode is a uniform encoding that supports all text, but is generally used only as an internal representation of text, where files, Web pages (also files), screen input and output, etc., need to use specific external codes such as GBK, UTF-8, etc.
    2. Encode and Decode are both "encode" and "decode" for Unicode, so encode is the UNICODE->STR process, decode is the process of str->unicode;
    3. Unicode and Str are twins, from basestring, so use Isinstance (S, basestring) to determine if S is a string.


GBK, UTF-8 and Unicode encoding issues in Python


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.