About all the Python3 Unicode features you don't want to know

About all the Python3 Unicode features you don't want to know _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags locale posix stdin

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

My readers know that I am a man who likes to scold Python3 Unicode. This time is no exception. I will tell you how painful it is to use Unicode and why I can't shut up. It took me two weeks to study Python3, and I needed to vent my disappointment. In these scolding, there is still useful information, because it teaches us how to deal with Python3. If I'm not bothered by it, read it.

The contents of this spit will be different. does not correlate to WSGI or HTTP and its associated objects. Normally, I was told I should stop complaining about the Python3 Unicode system, because I don't write code that people often write (HTTP libraries, etc.), so I'm going to write something else this time: a command-line application. I wrote a very convenient library called Click to make it easier to write it.

Note that what I do is what every novice python programmer does: write a command-line application. Hello World Program. But unlike ever before, I wanted to make sure that the application was stable and supported for both Python2 and Python3 Unicode, as well as unit testing. So the next step is how to implement it.

What do we want to do?

We need to use Unicode well as a developer in Python3. Obviously, I think this means that all the text data is Unicode and all the non text data is byte. In such a wonderful world all things are only black and white, and the example of Hello worlds is very straightforward. So let's write some shell tools.

This is an application implemented in the form of Python2:

Import sys
import shutil for
 
filename in sys.argv[1::
  f = sys.stdin
  if filename!= '-':
    try:
      f = Open (filename, ' RB ')
    except IOError as err:
      print >> sys.stderr, ' cat.py:%s:%s '% (filename, err)
      con Tinue
  with F:
    Shutil.copyfileobj (F, sys.stdout)

Obviously, the command is not particularly good at handling any command-line options, but it can be used at least. So let's start coding code.

Unicode in Unix

The code above is not going to work in Python2 because you're dealing with bytes in the dark. The command line arguments are bytes, the file name is byte, and the file content is byte. The language guard will point out that this is wrong, which can cause problems, but if you start thinking more about it, you'll find that it's an unfixed problem.

UNIX is a byte that has been defined as this, and always will be. To understand why you need to observe different scenarios for data transmission.

Terminal
Command line arguments
Operating system input and output layer
File System Driver

By the way, this is not the only thing that the data might pass through, but let's see how many scenarios we can learn about an encoding. The answer is not one. At least we need to understand that an encoding is a terminal output area information. This information can be used to show transformations and to understand the encoding that text messages have.

For example, if the value of Lc_ctype is En_us.utf-8 tells the application system to use US 中文版, and most of the text data is UTF-8 encoded. There are actually a lot of other variables, but we assume that's the only thing we need to see. Note that LC_CTYPE does not mean that all data is utf-8 encoded. It replaces how a notification application classifies text attributes and when it needs to apply transformations.

This is important because of C locale. C locale is the only field specified by POSIX, which says that all ASCII encodings and replies from command-line tools are treated as defined in the POSIX spec.

In the Cat tool above, if it is bit, there is no other way to treat the data. The reason is that the shell does not specify what this data is. For example, if you call Cat Hello.txt, the terminal encodes the hello.txt when it encodes the application.

But now think of this example echo *. The shell will pass all the file names of the current directory to your application. So, what are they coded? File name is not encoded!

Unicode Madness

Now a guy with Windows would say, "What's with the Unix people?" But that's not tragic. The reason for these jobs is that some smart people have designed the system to be backwards compatible. Unlike windows, which defines each API two times, the best way to do this on POSIX is to assume it as a byte for display purposes and encode it in the default encoding.

Use the cat command above to illustrate. For example, there is an error message about files that cannot be opened because they do not exist, or they are protected, or any other reason. We assume that the file is encoded using latin1 because it is from the external driver of 1995. The terminal gets the standard output, and it will try to encode it with Utf-8 because that's what it thinks. Because the string is latin1 encoded because it cannot be decoded smoothly. But not afraid, there will be no collapse, because your terminal can not handle it will ignore it.

What's it like on the graphical interface? There are two versions per type. List all the files in a graphical interface like Nautilus. It associates the file name with the icon, can double-click and tries to make the file name appear, and decodes it. For example, it will try to use Utf-8 decoding, the wrong place with the problem mark to replace. Your file name may not be fully readable, but that is you can still open the file.

Unicode on Unix can be crazy only when you force everything to use it. But that's not how Unicode works on UNIX. UNIX does not have an API that distinguishes between Unicode and byte. They are the same and make them easier to handle.

C Locale

C locale is a lot of times here. C locale is a means to avoid POSIX specifications being forcibly applied anywhere. The POSIX compliant operating system needs to support setting Lc_ctype to allow everything to use ASCII encoding.

This locale is selected under different circumstances. You mainly find that this locale provides an empty environment for all programs initiated from Cron, your initialization program, and the subprocess. C locale a sound ASCII zone in the environment, or you can't trust anything.

But the word ASCII indicates that it is a 7bit encoding. This is not a problem, because the operating system is capable of processing bytes! Any 8bit based content can be processed normally, but you follow the Convention with the operating system, so character processing is limited to the first 7bit. Any information generated by your tool will be encoded in ASCII and used in English.

Note that the POSIX specification does not say that your application should die of flames.

Python3 died in flames.

Python3 chose a different position on Unicode than it did with UNIX. Python3 said: "Anything is Unicode (by default, unless it is in some cases, unless we send a duplicate encoded data, but even so, sometimes it is still Unicode, although it is the wrong Unicode)." The filename is Unicode, the terminal is Unicode,stdin and stdout is Unicode, and there are so many Unicode. Because UNIX is not unicode,python3 now the position is that it is right Unix is wrong and people should also modify the POSIX definition to add Unicode. In this case, the filename is Unicode, and the terminal is Unicode, so that you don't see some errors caused by the byte.

It's not just me saying that. These are the bugs that are caused by Python's brain-mutilation ideas about Unicode:

ASCII is a very bad file name encoding
Use Surrogateescape as default error handler
Python3 throws a Unicode error under C locale
LC Ctype=c,pydoc left an unused state to the terminal

If you Google it, you can find so many slots. See how many people have failed to install the PIP module because of some characters in changelog, or because of the home folder, or because SSH sessions are ASCII, or because they are connected using putty.

Python3 Cat

Now start repairing cat for Python3. How do we do that? First, we need to deal with bytes, because something might show something that doesn't match the shell code. So anyway, the file content needs to be byte. But we also need to open the base output to allow it to support bytes, which by default are not supported. We also need to handle some cases separately, such as the Unicode API failure because the encoding is C. So this is the cat with the Python3 feature.

Import SYS import Shutil def _is_binary_reader (Stream, Default=false): Try:return isinstance (stream.read (0), byt
  ES) except Exception:return default Def _is_binary_writer (Stream, Default=false): Try:stream.write (b ")
  Except Exception:try:stream.write (") return False except Exception:pass return default Return True def Get_binary_stdin (): # Sys.stdin might or might not to binary in some extra. By # Default It's obviously non binary which is the core of the # problem but the docs recomend-changing it to binary For such # cases so we need to deal with it.
  Also someone might put # Stringio there for testing. Is_binary = _is_binary_reader (Sys.stdin, False) if Is_binary:return sys.stdin buf = getattr (Sys.stdin, ' buffer ', None) if BUF is not none and _is_binary_reader (buf, True): Return buf raise RuntimeError (' Did not manage to get bi Nary stdin ') def get_binary_stdout (): If _is_binary_writeR (Sys.stdout, False): return sys.stdout buf = GetAttr (sys.stdout, ' buffer ', None) if buf are not None and _is_binar Y_writer (BUF, True): Return buf raise RuntimeError (' Did don't manage to get binary stdout ') def filename_to_ui (value ): # The Bytes branch is unecessary for *this* script but otherwise # necessary as Python 3 still supports addressing
  The files by bytes # through separate APIs. If isinstance (value, bytes): value = Value.decode (sys.getfilesystemencoding (), ' replace ') Else:value = Value.en
Code (' Utf-8 ', ' surrogateescape ') \. Decode (' Utf-8 ', ' replace ') return value binary_stdout = Get_binary_stdout ()
      For filename in sys.argv[1:]: If filename!= '-': try:f = open (filename, ' RB ') except IOError as err: Print (' cat.py:%s:%s '% (filename_to_ui (filename), err), File=sys.stderr) continue Els E:f = Get_binary_stdin () with F:shutil.copyfileobj (F, binary_stdout)

This is not the worst version. Not because I want to complicate things, it's so complicated now. For example, what is not done in the example is to read a binary object that is forced to clean the text stdout. This is not necessary in this case because the print call here goes to stderr instead of stdout, but if you want to print some stdout, you have to clean it up. Why? Because stdout is a buffer on top of another buffer, your output order may be wrong if you do not force it to clean it up.

Not just me, for example: Twisted ' s compat module, will find the same trouble.

Jump Code Dance

To understand the command-line arguments in the shell, by the way, some of the worst things in Python3:

The shell passes the filename to the script in bytes
Bytes are decoded by Python in the desired decoding mode before hitting your code. Because this is a bad process, Python3 uses a special error handler to handle the decoding error.
The Python code handles a file that has no errors and needs to format an error message. Because when we write the text stream, if it is not illegal Unicode, it will not write a substitution.
Encodes the Unicode string containing the substitution as utf-8, and then tells it to handle the substitution escape.
Then we decode it from Utf-8 and tell him to ignore the error.
The result string returns to the stream of text only
Then the terminal decodes our string to show it.

Here's what happened in Python2:

Shell passes filename as byte to script
Shell decode string to display

Because the string processing in the Python2 version is only corrected when the error occurs, because the shell can do a better job of displaying the filename.

Note that this does not make the script even more wrong. If you need to do the actual string processing of the input data, you should switch to Unicode processing in 2.x and 3.x. But in that case, you also want your script to support a-charset parameter, so work on 2.x and 3.x is similar. It's only going to get worse on 3.x, you need to build a binary standard output that you don't need on 2.x.

But you were wrong.

Obviously I was wrong and I was told this:

I feel pain because I do not think like beginners, the new Unicode system will be more friendly to beginners
I don't consider how much improvements Windows users and new text models are to Windows users
The problem is not python, the problem is POSIX spec
Linux distributions need to start supporting c.utf-8 because they have been blocked by the past
The problem is that SSH sent the wrong encoding. SSH needs to fix this problem.
The real problem with a bunch of Unicode bugs in Python3 is that people don't pass explicit coding and assume that Python3 made the right decision.
I work with the decomposition code, which is obviously more difficult in Python3.
I should try to improve Python3 instead of complaining on Twitter and blogs.
You make a problem where there is no problem. It's good to have everyone fix their environment and encode anything. This is the user's problem.
Java has been having this problem for years, which is no problem for developers.

Do you know? I stopped complaining when I was working on the HTTP side, because I accepted the idea that a lot of HTTP/WSGI's problems were common to people. But what do you know? In the case of Hello World, there is the same problem. Maybe I should give up and get a high quality Unicode support library, and that's it.

I can refute the above argument, but in the end it doesn't matter. If Python3 is the only Python language I use, I will solve all the problems and use it to develop. There is a perfect another language called Python2, it has a larger user base and the user base is very solid. At this time I was very frustrated.

Python3 may be strong enough to start letting Unix walk through windows: Unicode is used in many places, but I doubt it.

The more likely thing is that people still use Python2 and use Python3 to do something awful. Or they'll use go. The language uses a model similar to Python2: Everything is a byte string. and assume that the code is UTF-8. To this end.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More