Python Coded crawl Pit guide

Source: Internet
Author: User

  I have been learning Python recently, this is really a very short language, very much like the language behind the strong function library support language. But the new contact soon met with a headache for the problem of coding, in the online search a lot of information here to do a summary, right when a record also for the brothers and sisters later service, if you can let you take a few detours I will feel honored.

Let's start by describing the phenomenon:

Import OS  for  in Os.listdir ("E:\Torchlight II"):    Print i

  The code is simple we use the OS's Listdir function to traverse the E:\Torchlight II directory (torchlight?!). :), because some of the files in this directory are named in Chinese, so the final print results were garbled, like this:

  So where is the problem? Don't worry, we'll analyze it at 1.1.

From here and here we can almost certainly know that the problem is in:

This means, the Python console app can ' t write the given character to the console ' s encoding. More specifically, the Python console app created a _io. Textiowrapperd instance with a encoding that cannot represent the given character.sys.stdout--_io. Textiowrapperd---(your console)

  See here I do not know whether you and I think, can you set the console encoding, set it to understand the Chinese characters in the encoding can not be normal to display the language? Wait, let's take a few moments in Google,

Python determines the encoding of stdout and stderr based on the value of the LC_CTYPE variable, but only if the stdout is A TTY. So if I just output to the terminal, LC_CTYPE (or Lc_all) define the encoding. However, when the output was piped to a file or to a different process, the encoding was not defined, and defaults to 7-bit Ascii.

A more detailed explanation is as follows:

1). When Python finds its output attached to a terminal, it sets the sys.stdout.encoding attribute to the terminal ' s encod Ing. The print statement ' s handler would automatically encode Unicode arguments into str output.2). When Python does isn't detect the desired character set of the output, it sets Sys.stdout.encoding to None, and print would I Nvoke the "ASCII" codec.

ho ho, it seems that the idea is feasible just not too elegant, because we have to modify the system settings. In fact, the above discussion is based on the Linux environment, under Linux may require us to change the value of an environment variable (LC_CTYPE or LANG), if we are under Windows, the console's encoding settings are related to the operating system's regional settings. For example, in the Chinese Win7 environment, the console default encoding is GBK (cp936). You can try the following code:

Import locale print Locale.getdefaultlocale () [1]

the console code is not good set it can set the stdout.out.encoding to achieve our goal? Unfortunately, the answer is no, this guy is simply read-only:

  Is there no way out? No, actually we are very close to the success, come, according to the data retrieved from the above analysis to see what we have now mastered the situation are:

  1). Console does not display Chinese correctly, console encoding is determined by the operating system (in Windows environment);

2). My operating system is Win7 Chinese version (GBK), enc = Locale.getdefaultlocale () [1];

3). The Code of the console determines the value of the sys.stdout.encoding, sys.stdout.encoding = Utf-8;

4). The string returned from the list of operating systems (E:\Torchlight II) is also GBK encoded

  Have you seen the problem? The most bizarre question mark in the top is because the string itself is encoded according to GBK, but because sys.stdout.encoding = Utf-8, Causes print to be converted to Unicode characters by encode the data of input as Utf-8. This, of course, is wrong. The reason is clear, to change the code:

Import OS  for  in Os.listdir ("E:\Torchlight II"):    print i.decode (' GBK ' )

  In the code we manually told Python to decode the read-in string by chapter GBK encoding, and after that the data is already a standard Unicode character and can be assured that print is printed out (even if sys.stdout.encoding = UTF-8):

Ps:

In fact, Google also found a lot of relevant similar coding problems, such as here , and here . Although the problem is ever-changing and the solution is varied and even Python's own specific solution, such as here . But these questions are essentially the same all about character encoding and decoding, figuring out the nature of all the problems that can be solved.

Give me a few references that I think are valuable:

Http://docs.python.org/howto/unicode.html#history-of-character-codes Unicode howto

http://farmdev.com/talks/unicode/ Unicode in Python, Completely demystified

Http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror python, Unicode and Unicodedecodeerror

  http://www.joelonsoftware.com/articles/Unicode.html the Absolute Minimum every software Developer absolutely, Positively must Know about Unicode and Character sets

Source: http://www.cnblogs.com/pinopino/archive/2012/10/04/2711347.html

Python Coded crawl Pit guide

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.