python2.7 Coding Problem Collation

Source: Internet
Author: User

This article turns from: http://www.cnblogs.com/fnng/p/5008884.html.

Must say, the Insect Master blog article quality is very worthy of reference, read down throughout, solve a lot of problems.

In view of good collection habits, reproduced here. After the experience of what you have, you can also directly modify the editor.

-------------------------------------------------------------------------------------------

0. Recognize common encodings

  GB2312 is a Chinese set of Chinese character coding, can also be said to be Simplified Chinese character set encoding

  GBK is an extension of GB2312, in addition to compatibility with GB2312, it can also display traditional Chinese, as well as Japanese kana

  cp936: Chinese Local system is the CMD in Windows, the default codepage is cp936,cp936 refers to the system in the No. 936 code format, that is, GB2312 encoding.

(Of course there are other encoding formats: cp950 Traditional Chinese, cp932 Japanese, cp1250 Central European language ... )

  Unicode is a character encoding scheme developed by international organizations that can accommodate all the words and symbols in the world. UTF-8, UTF-16, and UTF-32 are coding schemes that convert numbers to program data.

  UTF-8 (8-bit Unicode transformation Format) is one of the most popular encoding methods for propagating and storing Unicode. It uses a different bytes to represent each code point. ASCII characters each need only one byte, which is the same as ASCII encoding. So, ASCII is a subset of UTF-8.

In the process of developing a Python program, there are three aspects of coding involved:

    • Code for Python Program Files
    • Encoding of the Python Program Runtime Environment (IDE)
    • Python program reads the encoding of external files and Web pages

Python program encoding of the file

For example:

Python2 comes with the IDE, when you create a file to save the prompt:

This is because the default encoding for the Python2 editor is ASCII, which is not recognized in Chinese, so the hint pops up. This is also when we write the Python2 program in most cases when the first line of the program added:#coding =utf-8

In fact, the coding file here is very easy to solve.

encoding of the Python Program Runtime Environment (IDE)

Execute the following procedure.

#coding=utf-8import webdriverdriver = webdriver. Firefox () driver.get ("http://www.baidu.com")# back to Baidu page bottom record information text = Driver.find_element_by_ ID ("cp"). Textprint(text) driver.close ()       

execute under windows cmd:

The information we want to obtain is:

Baidu use before the need to read feedback Beijing ICP certificate 030173 No.

Windows cmd uses the cp936, which is the Chinese GB2312, in the GBK character set does not have "?", which leads to the encoding problem when parsing through GBK.

This is like when you are translating English, there is a word, the word you searched the Oxford dictionary did not find the corresponding meaning of interpretation, then naturally there will be problems.

That assumes, I also want to execute this python program under cmd , then can go to change the default encoding type of CMD is utf-8, the corresponding encoding is CHCP 65001( Utf-8). under cmd , enter:chcp 65001 command to enter.

Then, modify the cmd font to "lucida Console", and then execute the program can be correctly output.

Python program reads the encoding of external files and Web pages

# This piece, for the time being, can't find the right example

View Python system encoding

View the system encoding for Python2 or Python3.

Python2:

Python 2.7.10 (default, May, 09:40:32) [MSC v.1500,copyright"credits"license () C10>forImport sys>>> sys.getdefaultencoding ()'ascii'   

Python3:

Python 3.5.0 (v3.5.0:374f501f4567, Sep, 02:27:37) [MSC v.1900"copyright"credits"  License ()forimport sys>>> sys.getdefaultencoding ()'utf-8' 

So how to modify the Python2 system code for urf-8 ?

Import sysreload (SYS) sys.setdefaultencoding ('utf-8')   

Therefore, in the course of your program execution, encountered the following error message.

Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1 ....

You can add the above three lines of code to the Python program's head.

Decode () and encode ()

  • The role of Decode is to convert other encoded strings into Unicode encoding,eg Name.decode ("GB2312"), which means that the GB2312 the encoded string name is converted to Unicode encoding.
  • The role of encode is to convert Unicode encoding into other encoded strings,eg Name.encode ("GB2312"), which means that the GB2312 The encoded string name is converted to GB2312 encoding.

For example, the previous get Baidu bottom information example. I can also resolve this by decode () and encode () :

#Coding=utf-8From seleniumImportWebdriverdriver = Webdriver. Chrome () driver.get ( "http://www.baidu.com" ) # Back to the bottom of the Baidu page record text = driver.find_element_by_id ( "cp "). Texttext2 = Text.encode (gbk ", ignore" gbk ") print (text2)            

Here, the Unicode encoding is converted to GBK encoding by encode (), and the "ignore" ignores the GBK unrecognized character (?) during the conversion, and then GBK converted to Unicode encoding. Of course, this is not a perfect way to sacrifice part of the string after all.

Chardet Module

Chardet is a very good code recognition module.

install via pip :

>pip Install Chardet

Use:

Import" Chinese " >>> Detect (a) {'confidence'encoding'koi8-r '}          

there is probably a 68% to the koi8-r encoding type.

python2.7 Coding Problem Collation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.