Python2 Coding Summary

Source: Internet
Author: User

The following list of Python2 frequently encountered several problems and explanations.

#-*-Coding:utf-8-*-

Python2 by default in ASCII encoding, but in the actual coding process, we will use a lot of Chinese, in order to not include the Chinese program error, but also in order to comply with international conventions, we generally set our file encoding to UTF-8 format.

There are many formats for setting the encoding, as long as the declaration of the first or second line conforms to the regular expression "coding[:=]\s* ([-\w.] +) "Can, the general declaration Way is #-*-Coding:utf-8-*-.

str = "Hello" Print str

Run the above code, the program will error: Syntaxerror:non-ascii character ' \xe4 ' in the file d:/testpython/test/111.py on lines 1, but no encoding declared ; See http://python.org/dev/peps/pep-0263/for details. This is a message that has non-ASCII encoded characters in the program. If the Utf-8 statement is added, the program will not error.

#-*-Coding:utf-8-*-str = "Hello" Print str

Although the above wording will not error, but the output is garbled, why? That's what we're going to talk about.

Encode and Decode

Explain the encoding and decoding before, first of all speaking of the relationship between Unicode and utf-8, recommend this blog to everyone.

It can be understood that a string is made up of characters that are stored in a binary form in the computer hardware, which is encoded in binary form. If you directly use the string ↔ character ↔? Binary representation (encoding), you increase the complexity of the conversion between different types of encodings. So the introduction of an abstraction layer, "string ↔? character ↔?" a store-independent representation of ↔-binary representation (encoding) ", so that a character can be represented in a storage-independent form, which can be converted to this abstraction layer before conversion to another encoded form. Here, Unicode is "storage-independent representation", and utf-8 is "binary representation". There are two representations of strings in Python2, str and Unicode. STR can be understood as the binary encoding format in the preceding paragraph, and Unicode can be interpreted as an abstraction layer. Encode is encoded, that is, from Unicode format to binary encoding format such as Utf-8, gb2312, and so on. Decode is decoding, that is, from the binary encoding format to the Unicode encoding format. See the code below:
#-*-Coding:utf-8-*-

str1 = "Hello"
Print type (STR1)
STR2 = Str1.decode ("Utf-8")
Print type (STR2)

STR1 is a str type and is converted to a Unicode type by decode.

Here's a look at the encode code:

#-*-Coding:utf-8-*-str1 = u "Hello" print type (str1) str2 = Str1.encode ("utf-8") print type (STR2)

STR1 is a Unicode type and is converted to STR by encode.

We look back at the very beginning of the question left, why the code output garbled it. Because the encoding format specified by the file is Utf-8, but our print is printed to the console, the console cannot display characters in UTF-8 encoded format. So we're going to change the format.

#-*-Coding:utf-8-*-str = "Hello" str = str.decode ("utf-8") print str

Many times encoding and decoding need to add ignore parameters in order to correctly convert, such as. encode (' utf-8 ', ' ignore ') or. Decode (' utf-8 ', ' ignore '), let's think for yourself.

Chardet getting the encoding format

Sometimes we can not know what the string is encoded, such as crawling Web pages, some are utf-8, some are gb2312 encoded, then we can get the encoding format and converted to Unicode. A third-party library Chardet is introduced here. The way of use is probably as follows:

#-*-coding:utf-8-*-import chardetstr = "xxxxx" Str_type = chardet.detect (str) code = str_type[' encoding ']

Code is the encoding format for Str. Some people, however, reflect that the coding format is inaccurate and slow. I test, the speed is indeed general, but there has not been an inaccurate situation. Everyone can use, I just provide a train of thought, if who there is a better way, can tell the younger brother, the generous enlighten is.

Import Sys

Reload (SYS)

Sys.setdefaultencoding (' UTF8 ')

Before also encountered very inexplicable coding error, online Search This method can solve on the confused use, also do not know what principle. See a good blog today, recommend to everyone: http://blog.csdn.net/crazyhacking/article/details/39375535. The following is referenced from this article:

The encoding and decoding in Python is the conversion between Unicode and Str. Encoding is Unicode-STR, instead, decoding is str---Unicode. The rest of the problem is deciding when to encode or decode. The "code indication" at the beginning of the file, which is the #-*-coding:-*-this statement. The Python default script file is UTF-8 encoded and is corrected with a "coded indication" when there are characters in the file that are not UTF-8 encoded in the range. About Sys.defaultencoding, this is used when decoding does not explicitly indicate the decoding method. For example, I have the following code: #! /usr/bin/env python #-*-coding:utf-8-*-s = ' Chinese ' # Note here that STR is of type STR instead of Unicode s.encode (' GB18030 ') This code will re-encode s to The GB18030 format, which is the conversion of Unicode-Str. Because s itself is the STR type, Python will automatically decode s to Unicode first and then encode it into GB18030. Because decoding is done automatically by Python, and we do not specify the decoding method, Python uses the sys.defaultencoding to decode it in the way indicated. In many cases sys.defaultencoding is anscii, and if S is not the type it will go wrong. In the above case, my sys.defaultencoding is Anscii, and the encoding method of S and the file encoding method is consistent, is UTF8, so error: Unicodedecodeerror: ' ASCII ' codec can ' t decod e byte 0xe4 in position 0:ordinal No in range (128) for this situation, we have two ways to correct the error: One is to explicitly indicate the encoding of S and #!  /usr/bin/env python #-*-coding:utf-8-*-s = ' Chinese ' s.decode (' utf-8 '). Encode (' GB18030 ') the second is to change the way sys.defaultencoding is encoded for the file #! /usr/bin/env Python #-*-Coding:Utf-8-*-Import sys reload (SYS) # Python2.5 This method is removed after initialization, we need to reload sys.setdefaultencoding (' sys.setdefaultencoding ' After reading str = ' Chinese ' str.encode (' gb18030 '), change it to print "<P>ADDR:", form["addr"].value.decode (' gb2312 '). Encode (' Utf-8 ' ') successfully passed.

But this way is awkward, or as far as possible to control the code, clear the encoding format, his writing is also practical.

Personal Summary

In the actual programming process, it is best to unify the encoding format within the code, such as Unicode, because it does not take into account the problem of coding. To the storage type (Utf-8, GBK) when displaying or outputting.

The above for the recent development of Python code in the process encountered some problems and summary, if there is anything wrong, please reply to the Exchange in a timely manner, thank you.

Python2 Coding Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.