Python2 and Python3 encoding and decoding

Source: Internet
Author: User
Tags string methods

Today let's take a complete picture of the truth of the Py code, including Py2 and Py3. Some students may ask: After Py3 is the general trend, it is necessary to understand the py2 that the headache of the code? The answer is so necessary. Py2 is still a mainstay in production.

What is encoding?

The basic concept is simple. First, we start with a piece of information, the message, that the message exists in a human understandable and understandable sense. I intend to call this expression "plaintext" (plain text). For people who speak English, the English words printed on the paper or displayed on the screen count as clear text.

Second, we need to be able to turn the plaintext message into some other representation, and we need to be able to turn the encoded text back into plaintext. The conversion from plaintext to encoded text is called "encoding", and it is "decoded" from the encoded text back to Cheng Mingwen.

...

PY2 encoding

STR and Unicode

Both STR and Unicode are subclasses of the basestring. In strict sense, str is actually a byte string, which is a sequence of Unicode encoded bytes. When using the Len () function for the UTF-8 encoded STR ' Court ', the result is 3 because the UTF8 encoded ' garden ' = = ' \xe8\x8b\x91 '.

While Unicode is a string, str is a sequence of Unicode characters that are encoded (UTF8,GBK, etc.). As above UTF8 encoded string ' han '.

Unicode is the real string, which is obtained after decoding the byte string str with the correct character encoding, and Len (U ' garden ') = = 1.

In the Py2, Str=bytes.

The biggest feature of PY2 encoding is that Python 2 will automatically decode the bytes data into a Unicode string

So in 2 we can stitch the bytes to the string.

#Coding:utf8Print‘ Yuanhao  ' # Yuanhao print repr ( '  Yuanhao  "#" \xe8\x8b\x91\xe6\x98\x8a " Span style= "COLOR: #0000ff" >print (U "hello" + "yuan" ) #print (U ' Court hao ' + ' most Handsome ') # Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe6 # In position 0:ordinal No in range           

Two questions:

1 print ' Court Hao ': originally saved is ' \xe8\x8b\x91\xe6\x98\x8a ', why show the Yuanhao clear text?

2 byte strings and strings can be stitched together?

This is the hateful unicodeerror. Your code contains Unicode and byte strings, so long as the data is all ASCII, all conversions are correct, and once a non-ASCII character sneaks into your program, the default decoding will fail, causing unicodedecodeerror errors.

Python 2 Quietly masks the byte-to-Unicode conversion, making it easier to process ASCII. The price of your comeback is that it will fail when dealing with non-ASCII.

  

Take a look at the example methods of encode () and Decode () two basestring, and after understanding the differences between STR and Unicode, the two methods will no longer be confused:

1234567891011 #coding:utf8u = u‘苑‘print repr(u)  # u‘\u82d1‘# print str(u)   #UnicodeEncodeErrors = u.encode(‘utf8‘)print repr(s) #‘\xe8\x8b\x91‘print str(s)  #  苑    u2 = s.decode(‘utf8‘)print repr(u2) # u‘\u82d1‘
PY3 encoding

Python3 renamed the Unicode type to str with the old STR type have been replaced by bytes.

Like Python 2, Python 3 also has two types, one Unicode and one byte. But they have different names.

Now that you have converted from plain text to "str" type, you are storing a Unicode, and the "bytes" type stores a byte string. You can also create a byte string with a B prefix.

The most important new feature of Python 3 is probably a clearer distinction between text and binary data. Text is always Unicode, represented by the STR type, and binary data is represented by the bytes type. Python 3 does not mix str and bytes in any implicit way, which makes the distinction between them particularly clear. You cannot stitch strings and byte packets, search for strings in a byte packet (or vice versa), or pass a string into a function with a byte packet (or vice versa). It's a good thing.

The biggest change to Unicode support in Python 3 is that there will be no automatic decoding of byte strings. If you want to link a byte string with a Unicode, you will get an error, no matter what content you include.

All of this will be handled implicitly in Python 2, and you will get an error in Python 3.

12 #print(‘alvin‘+u‘yuan‘)#字节串和unicode连接 py2:alvinyuanprint(b‘alvin‘+‘yuan‘)#字节串和unicode连接 py3:报错 can‘t concat bytes to str

Transformation:

Importjsons=‘Yuanhao‘Print (Json.dumps (s))#"\U82D1\U660A" B1=s.encode (‘Utf8‘)Print (B1,type (B1))#B ' \xe8\x8b\x91\xe6\x98\x8a ' <class ' bytes ' >Print (B1.decode ('UTF8'))# Yuanhao # print (B1.decode (' GBK ')) # ã Language 槉b2=s.encode (' GBK')print (B2,type (B2)) #' \xd4\xb7\xea\xbb ' <class ' bytes ' >print ( B2.decode ('gbk') # Yuanhao          

Note: Regardless of the py2, or PY3, the Unicode data directly corresponds to the plaintext, and the printed Unicode data displays the corresponding plaintext (both English and Chinese)

Encoding implementation

When it comes to coding, we need to master this process in a global way, such as when we write a. py file on Pycharm, from save to run data how exactly is it converted?

Before we solve this problem, we need to solve a problem: the default encoding

Default encoding

What is the default encoding? In fact, your interpreter is the default encoding when interpreting the code, in Py2 the default encoding is ASCII, and in Py3 it is UTF8 (sys.getdefaultencoding () view).

1 #-*- coding: UTF-8 -*-

What does this statement do? We at the beginning only know in py2 if not add such a sentence, the program once appeared in Chinese will error, in fact, because py2 default ASCII code, for Chinese these special characters can not be encoded;

The statement is to tell the python2.7 interpreter (the default Acsii encoding method) to interpret the hello.py file declaration below the content by UTF8 encoding, yes, is encoded (encoded into a byte string and finally turned into 0101 in the form of the machine to execute)

Note that hello.py files are stored with their own specific encoding, such as UTF8, such as GBK.

It is important to note that the encoding of the declaration must be the same as the encoding used when the file is actually saved, otherwise there is a large chance of a code parsing exception. Now the IDE will typically automatically handle this situation, changing the declaration and then replacing the declared encoding save, but the text editor controls need to be careful. Therefore, the saved encoding style depends on the default style of your editor (adjustable).

File save and Execute process

We have said that strings are stored in memory as Unicode data, but when is our data in memory? Let's resolve the process together.

For example, we create a hello.py file on the Pycharm (py3.5):

1 print(‘hello 苑昊‘)

Is our data in memory at this time? NO, it has been saved to the hard drive (binary data) by Pycharm with the default file save encoding, so be sure to note that when you click Run, you actually need to open this file first, then transfer all the data to the memory, The string is now stored in Unicode data format on a block of memory address (why do you want to do this), the other content is UTF8 encoding, and then the interpreter can be interpreted by default UTF8 encoding line by row.

Therefore, the error occurs when your file is saved with a code that is inconsistent with the code interpreted by the interpreter.

What did print do?

In Python 2, print is a statement (statement), and in Python 3 it becomes a function.

Print statement

In Python 2, the simplest form of use of the print statement is that print A it is equivalent to executing:

1 sys.stdout.write(str(A) +‘\n‘)

If you pass the extra arguments (argument) with a comma delimiter, the arguments are passed to the str() function, and the final print will be empty between each parameter.

# print A, B, C equals Sys.stdout.write ('\ n'). If you add a comma at the end of the print statement, the break character (\ n) is no longer added, which means:# print A is equivalent to Sys.stdout.write (str (A))    
Print function
ImportSysDefPrint (*objects, Sep=none, End=none, File=none, flush=False):a Python translation of the C Code for Builtins.print (). Span style= "COLOR: #800000" "" "" if Sep is None:sep =  ' " if end is None:end =  ' \n " if file is None:file = Sys.stdout File.write (sep.join (Map (str, objects)) + end) if Flush:file.flush ()      

As we can see from the example code above, there are obvious advantages to using the Print function: We can now specify additional delimiters (separator) and terminator (end string) compared to using the print statement.

Because our goal is coding, so the benefits of the print function we don't mention here.

So, regardless of 2 or 3, for print we need to clarify one way: Str ()

Py2:str ()
#Class str (object= ")#Return A string containing a nicely printable representation of an object. For#Strings, this returns the string itself. The difference with repr (object) was that#Str (object) does not always attempt to return a string, that's acceptable to#eval (); Its goal are to return a printable string. If no argument is given,# Returns the empty string, ".  # For more information on strings see Sequence types-str, Unicode, list, tuple, # ByteArray, buffer, xrange which describes sequence functionality ( Strings is # sequences), and also the string-specific Methods described in the String methods # section. To output formatted strings use template strings or the% operator Described#" in the String formatting Operations section. In addition see the String Services # section. See also Unicode ().              
Py3:str ()
#Class str (object=b ', encoding= ' utf-8 ', errors= ' strict ')#Return A string version of object. If object is not provided, returns the empty string. Otherwise, the behavior of STR () depends on#Whether encoding or errors is given, as follows.##If neither encoding nor errors is given, str (object) returns OBJECT.__STR__ (), which is the "informal" or nicely printable String#Representation of object. For string objects, this is the string itself. If object does not has a __str__ () method, then Str () falls#Back to returning Repr (object).## If at least one of the encoding or errors is given, the object should be a Bytes-like object (e.g. bytes or bytearray). In the case, if object# is a bytes (or ByteArray) objects, then str (bytes, encoding, errors) are equivalent to Bytes.decode (encoding, Erro RS). Otherwise, the bytes# Object underlying the buffer object is obtained before calling Bytes.decode (). See Binary Sequence types-bytes, ByteArray, Memoryview # and buffer Protocol for information on Buffer objects.< c6># # passing a bytes object to str () without the encoding or errors arguments falls under the first case of Retu Rning the informal string# representation (see also the-b command-line option to Python). For example:         
Common coding Error 1 garbled problem under cmd

hello.py

#Coding:utf8print (' Yuanhao ')    

The encoding of the file when it is saved is also UTF8.

Think: Why under the IDE with 2 or 3 to do all right, under the Cmd.exe 3 is correct, 2 garbled it?

We in win under the terminal namely Cmd.exe to execute, everybody notice, Cmd.exe itself is a software; when we python2 hello.py, the Python2 interpreter (default ASCII encoding) goes to the declared UTF8 encoded file, and the file is UTF8 saved , so no problem, when we print ' Court Hao ', the interpreter side of the normal execution, will not error, Just print content will be passed to Cmd.exe display, and in py2 this content is UTF8 encoded byte data, and this software default encoding decoding method is GBK, so cmd.exe with GBK decoding way to decode UTF8 nature will garbled.

Py3 The correct reason is that the Unicode data passed to CMD is in accordance with ISO uniform standards, so no problem.

1 print(u‘苑昊‘)

After changing to this, the CMD under 2 will not have a problem.

2 Print Issues

In the Py2.

123 #coding:utf8print(‘苑昊‘) #苑昊print([‘苑昊‘,‘yuan‘]) #[‘\xe8\x8b\x91\xe6\x98\x8a‘, ‘yuan‘]

In the Py3.

12 print(‘苑昊‘) #苑昊print([‘苑昊‘,‘yuan‘]) #[‘苑昊‘, ‘yuan‘]

Reprint Yu Yu Teacher's blog: http://www.cnblogs.com/yuanchenqi/articles/5938733.html

Python2 and Python3 encoding and decoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.