Python coding problem Grooming "Go"

Source: Internet
Author: User

Recognize common encodings

  GB2312 is a Chinese set of Chinese character coding, can also be said to be Simplified Chinese character set encoding

  GBK is an extension of GB2312, in addition to compatibility with GB2312, it can also display traditional Chinese, as well as Japanese kana

  cp936: Chinese Local system is the CMD in Windows, the default codepage is cp936,cp936 refers to the system in the No. 936 code format, that is, GB2312 encoding.

(Of course there are other encoding formats: cp950 Traditional Chinese, cp932 Japanese, cp1250 Central European language ... )

  Unicode is a character encoding scheme developed by international organizations that can accommodate all the words and symbols in the world. UTF-8, UTF-16, and UTF-32 are coding schemes that convert numbers to program data.

  UTF-8 (8-bit Unicode transformation Format) is one of the most popular encoding methods for propagating and storing Unicode. It uses a different bytes to represent each code point. ASCII characters each need only one byte, which is the same as ASCII encoding. So, ASCII is a subset of UTF-8.

In the process of developing a Python program, there are three aspects of coding involved:

    • Code for Python Program Files
    • Encoding of the Python Program Runtime Environment (IDE)
    • Python program reads the encoding of external files and Web pages

Python program encoding of the file

For example:

Python2 comes with the IDE, when you create a file to save the prompt:

This is because the default encoding for the Python2 editor is ASCII, which is not recognized in Chinese, so the hint pops up. This is also when we write the Python2 program in most cases when the first line of the program added: #coding =utf-8

In fact, the coding file here is very easy to solve.

Encoding of the Python Program Runtime Environment (IDE)

Execute the following procedure.

#coding =utf-8from Selenium Import webdriverdriver = Webdriver. Firefox () driver.get ("http://www.baidu.com") # Back to Baidu page bottom record information text = driver.find_element_by_id ("CP"). Textprint (text) Driver.close ()

Execute under Windows cmd:

The information we want to obtain is:

©2015 Baidu to use the pre-reading feedback Beijing ICP Certificate No. No. 030173

Windows cmd uses cp936, which is the Chinese GB2312, which does not have "©" in the GBK character set, which leads to coding problems when parsing through GBK.

This is like when you are translating English, there is a word, the word you searched the Oxford dictionary did not find the corresponding meaning of interpretation, then naturally there will be problems.

That is to say, I also want to execute this Python program under CMD, then can go to change the default encoding type of CMD is Utf-8, the corresponding encoding is CHCP 65001 (utf-8). Under cmd enter: chcp 65001 command to enter.

Then, modify the cmd font to "Lucida Console", and then execute the program can be correctly output.

Python program reads the encoding of external files and Web pages

#这一块, we haven't found a suitable example yet.

View Python system encoding

View the system encoding for Python2 or Python3.

Python2:

Python 2.7.10 (default, May, 09:40:32) [MSC v.1500 + bit (Intel)] on Win32type "copyright", "credits" or "license () "For more information.>>> import sys>>> sys.getdefaultencoding () ' ASCII '

Python3:

Python 3.5.0 (v3.5.0:374f501f4567, Sep, 02:27:37) [MSC v.1900-bit (AMD64)] on Win32type "copyright", "credits" or "license ()" For more information.>>> import sys>>> sys.getdefaultencoding () ' Utf-8 '

So how to modify the Python2 system code for URF-8?

Import sysreload (SYS) sys.setdefaultencoding (' Utf-8 ')

Therefore, in the course of your program execution, encountered the following error message.

Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1 ....

You can add the above three lines of code to the Python program's head.

Decode () and encode ()

    • The role of Decode is to convert other encoded strings to Unicode encoding, eg Name.decode ("GB2312"), to convert the GB2312 encoded string name to Unicode encoding.
    • The role of encode is to convert the Unicode encoding into other encoded strings, eg Name.encode ("GB2312"), to convert the GB2312 encoded string name to GB2312 encoding.

For example, the previous get Baidu bottom information example. I can also resolve this by decode () and encode ():

#coding =utf-8from Selenium Import webdriverdriver = Webdriver. Chrome () driver.get ("http://www.baidu.com") # Back to Baidu page bottom record information text = driver.find_element_by_id ("CP"). TEXTTEXT2 = Text.encode ("GBK", "ignore"). Decode ("GBK") print (TEXT2)

Here, the Unicode encoding is converted to GBK encoding by encode (), and the GBK unrecognized character (©) is ignored by "ignore" during the conversion, and then the GBK is converted to Unicode encoding. Of course, this is not a perfect way to sacrifice part of the string after all.

Chardet Module

Chardet is a very good code recognition module.

Install via PIP:

>pip Install Chardet

Use:

>>> from Chardet import detect>>> a = "Chinese" >>> Detect (a) {' confidence ': 0.682639754276994, ' Encoding ': ' Koi8-r '}

There are about 68% of the KOI8-R encoding type.

Python coding problem Grooming "Go"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.