Python code crawling guide, python Encoding

Last Update:2016-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have been learning python recently. This is really a very short and concise language. I like the language that is supported by powerful function libraries. However, I encountered a headache about encoding shortly after my first contact. I checked a lot of information on the Internet and made a summary here. I am entitled to serve as a record for later siblings, it would be a great honor to reduce your detours.

Let's first describe the phenomenon:

import osfor i in os.listdir("E:\Torchlight II"):    print i

The code is simple. We use the listdir function of the OS to traverse the directory E: \ Torchlight II (Torchlight ?! :)), Because some files in this directory are named in Chinese, garbled characters appear in the final print result, like this:

So where is the problem? Don't worry. Let's analyze it at 1.1.

From here and here we can almost certainly know the problem is:

This means that the python console app can't write the given character to the console's encoding.More specifically, the python console app created a _io.TextIOWrapperd instance with an encoding that cannot represent the given character.sys.stdout --> _io.TextIOWrapperd --> (your console)

I don't know if you want to be the same here. Can you set the console encoding and set it to the encoding that can understand Chinese characters? Wait, let's spend a while on Google,

Python determines the encoding of stdout and stderr based on the value of the LC_CTYPE variable, but only if the stdout is a tty. So if I just output to the terminal, LC_CTYPE (or LC_ALL) define the encoding. However, when the output is piped to a file or to a different process, the encoding is not defined, and defaults to 7-bit ASCII.

The details are as follows:

1). When Python finds its output attached to a terminal, it sets the sys.stdout.encoding attribute to the terminal's encoding. The print statement's handler will automatically encode unicode arguments into str output.2). When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec.

Zookeeper, it seems that the idea just now is not very elegant, because we have to modify the system settings. In fact, the above discussion is based on the linux environment. In linux, we may need to change the value of an environment variable (LC_CTYPE or LANG); if we are in windows, the console encoding settings are related to the operating system's regional settings. For example, in Chinese Windows 7, the default console encoding is GBK (cp936 ). You can try the following code:

import localeprint locale.getdefaultlocale()[1]

If the console encoding is not set properly, can we set stdout. out. encoding to achieve our goal? Unfortunately, the answer is no. This guy is read-only at all:

No way? No. Actually, we are very close to success. Let's analyze and sort out the information we found above to see what we know now:

1). the console cannot display Chinese characters normally. The console encoding is determined by the operating system (in windows );

2). My operating system is win7 Chinese version (GBK), enc = locale. getdefalocallocale () [1];

3) the console encoding determines the value of sys. stdout. encoding. sys. stdout. encoding = UTF-8;

4). The string returned from the operating system enumeration directory (E: \ Torchlight II) is also GBK encoded.

Have you seen the problem. The strange question mark in the top is because the string is encoded according to gbk, but because sys. stdout. encoding = UTF-8. As a result, print will encode the input data according to UTF-8 to convert it to unicode characters. This, of course, is wrong. The reason is clear. modify the code:

import osfor i in os.listdir("E:\Torchlight II"):    print i.decode('gbk')

In the code, we manually told python to decode the character strings read by chapter gbk. After this action, the data is already a standard unicode character, print the output (even if sys. stdout. encoding = UTF-8 ):

Ps:

In fact, google also found many problems related to similar encoding, such as here and here. Although the problem looks ever-changing, there are various solutions, or even python's own specific solutions, for example, here. However, the essence of these problems is the encoding and decoding of characters, which makes it clear that all the problems can be solved.

I will provide several valuable references:

Http://docs.python.org/howto/unicode.html#history-of-character-codesUnicode HOWTO

Http://farmdev.com/talks/unicode/Unicode In Python, Completely Demystified

Http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerrorPython, Unicode and UnicodeDecodeError

　　Http://www.joelonsoftware.com/articles/Unicode.htmlThe Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Source: http://www.cnblogs.com/pinopino/archive/2012/10/04/2711347.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python code crawling guide, python Encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python code crawling guide, python Encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support