Python code crawling guide, python Encoding

Source: Internet
Author: User

Python code crawling guide, python Encoding

I have been learning python recently. This is really a very short and concise language. I like the language that is supported by powerful function libraries. However, I encountered a headache about encoding shortly after my first contact. I checked a lot of information on the Internet and made a summary here. I am entitled to serve as a record for later siblings, it would be a great honor to reduce your detours.

Let's first describe the phenomenon:

import osfor i in os.listdir("E:\Torchlight II"):    print i

The code is simple. We use the listdir function of the OS to traverse the directory E: \ Torchlight II (Torchlight ?! :)), Because some files in this directory are named in Chinese, garbled characters appear in the final print result, like this:

So where is the problem? Don't worry. Let's analyze it at 1.1.

From here and here we can almost certainly know the problem is:

This means that the python console app can't write the given character to the console's encoding.More specifically, the python console app created a _io.TextIOWrapperd instance with an encoding that cannot represent the given character.sys.stdout --> _io.TextIOWrapperd --> (your console)

I don't know if you want to be the same here. Can you set the console encoding and set it to the encoding that can understand Chinese characters? Wait, let's spend a while on Google,

Python determines the encoding of stdout and stderr based on the value of the LC_CTYPE variable, but only if the stdout is a tty. So if I just output to the terminal, LC_CTYPE (or LC_ALL) define the encoding. However, when the output is piped to a file or to a different process, the encoding is not defined, and defaults to 7-bit ASCII.

The details are as follows:

1). When Python finds its output attached to a terminal, it sets the sys.stdout.encoding attribute to the terminal's encoding. The print statement's handler will automatically encode unicode arguments into str output.2). When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec.

Zookeeper, it seems that the idea just now is not very elegant, because we have to modify the system settings. In fact, the above discussion is based on the linux environment. In linux, we may need to change the value of an environment variable (LC_CTYPE or LANG); if we are in windows, the console encoding settings are related to the operating system's regional settings. For example, in Chinese Windows 7, the default console encoding is GBK (cp936 ). You can try the following code:

import localeprint locale.getdefaultlocale()[1]

If the console encoding is not set properly, can we set stdout. out. encoding to achieve our goal? Unfortunately, the answer is no. This guy is read-only at all:

No way? No. Actually, we are very close to success. Let's analyze and sort out the information we found above to see what we know now:

1). the console cannot display Chinese characters normally. The console encoding is determined by the operating system (in windows );

2). My operating system is win7 Chinese version (GBK), enc = locale. getdefalocallocale () [1];

3) the console encoding determines the value of sys. stdout. encoding. sys. stdout. encoding = UTF-8;

4). The string returned from the operating system enumeration directory (E: \ Torchlight II) is also GBK encoded.

Have you seen the problem. The strange question mark in the top is because the string is encoded according to gbk, but because sys. stdout. encoding = UTF-8. As a result, print will encode the input data according to UTF-8 to convert it to unicode characters. This, of course, is wrong. The reason is clear. modify the code:

import osfor i in os.listdir("E:\Torchlight II"):    print i.decode('gbk')

In the code, we manually told python to decode the character strings read by chapter gbk. After this action, the data is already a standard unicode character, print the output (even if sys. stdout. encoding = UTF-8 ):

 

Ps:

In fact, google also found many problems related to similar encoding, such as here and here. Although the problem looks ever-changing, there are various solutions, or even python's own specific solutions, for example, here. However, the essence of these problems is the encoding and decoding of characters, which makes it clear that all the problems can be solved.

I will provide several valuable references:

Http://docs.python.org/howto/unicode.html#history-of-character-codesUnicode HOWTO

Http://farmdev.com/talks/unicode/Unicode In Python, Completely Demystified

Http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerrorPython, Unicode and UnicodeDecodeError

  Http://www.joelonsoftware.com/articles/Unicode.htmlThe Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Source: http://www.cnblogs.com/pinopino/archive/2012/10/04/2711347.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.