About all the Python3 Unicode features you don't want to know

Last Update:2016-06-10 Source: Internet

Author: User

Tags posix

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

My readers know that I am a person who likes to scold Python3 Unicode. This time is no exception. I will tell you how painful it is to use Unicode and why I can't shut up. I spent two weeks studying Python3 and I needed to vent my disappointment. In these scolding, there is still useful information, because it teaches us how to deal with Python3. If you are not bothered by me, read it.

The contents of this slot will be different. does not correlate to WSGI or HTTP and its associated things. Normally, I was told I should stop complaining about the Python3 Unicode system, because I don't write code that people often write (HTTP libraries and things like that), so I'm going to write something else this time: a command-line application. I wrote a handy library called Click to make it easier to write.

Note that what I do is what every novice python programmer does: Write a command-line application. Hello World Program. But in the past, I wanted to make sure that the application was stable and supported for Python2 and Python3 Unicode, as well as unit testing. So the next step is how to implement it.

What we want to do

In Python3 we as developers need to use Unicode well. Obviously, I think this means that all text data is Unicode, and all non-textual data is bytes. In such a wonderful world all things are only black and white, and the example of Hello worlds is very straightforward. So let's write some shell tools.

This is an application implemented in the form of Python2:

Import sysimport shutil for filename in sys.argv[1:]:  f = sys.stdin  if filename! = '-':    try:      f = open (file Name, ' RB ')    except IOError as err:      print >> sys.stderr, ' cat.py:%s:%s '% (filename, err)      continue  With F:    Shutil.copyfileobj (F, sys.stdout)

Obviously, the command is not particularly good at handling any command-line options, but at least it can be used. So let's start with code codes.

Unicode in Unix

The above code is not possible in Python2 because you are secretly processing bytes. Command-line arguments are bytes, file names are bytes, and the contents of the files are bytes. The language defender will point out that this is not true, which can cause problems, but if you start thinking more about it, you will find that this is an issue that is not fixed.

UNIX is a byte that has been defined as such and always will be. To understand why you need to observe the different scenarios of data transmission.

Terminal
Command-line arguments
Operating system input and output layer
File System Driver

By the way, this is not the only thing that data may pass, but let's look at how many scenarios we can get to know about a code. The answer is no. At least we need to understand that an encoding is the terminal output area information. This information can be used to show conversions and to understand the encoding that text messages have.

For example, if the value of Lc_ctype is En_us.utf-8 tells the application system to use US 中文版, and most of the text data is UTF-8 encoded. There are actually many other variables, but we assume that this is the only thing we need to see. Note that LC_CTYPE does not mean that all data is utf-8 encoded. It replaces how the notification application classifies text attributes and when it needs to apply transformations.

This is important because C locale is the reason. C locale is the only locale specified by POSIX, which says all ASCII encodings and replies from command-line tools are treated as defined in POSIX spec.

In the Cat tool above, if it is a bit, there is no other way to treat the data. The reason is that the shell does not specify what the data is. For example, if you call Cat Hello.txt, the terminal encodes the hello.txt when it encodes the application.

But now think about this example, echo *. The shell will pass all the file names of the current directory to your application. So what are they coded? The file name is not encoded!

Unicode Madness

Now a person who uses Windows sees here will say: What is the man doing with UNIX? But it's not tragic. The reason for this is that some smart people have designed the system to be backwards compatible. Unlike windows, which defines each API two times, the best way to handle POSIX is to assume it as a byte for display purposes and encode it by default encoding.

Use the cat command above for an example. For example, there is an error message about files that cannot be opened because they are not present or they are protected, or any other reason. We assume that the file is encoded with latin1 because it is external driven from 1995. The terminal gets the standard output, and it will try to encode it in Utf-8, because that's what it thinks of as encoding. Because the string is latin1 encoded, because it cannot be decoded successfully. But don't be afraid, there won't be any crashes, because your terminal will ignore it when it can't handle it.

How does it look on the graphical interface? There are two versions of each. List all the files in a graphical interface like Nautilus. It associates the file name with the icon, is able to double-click and tries to make the file name appear, thus decoding it. For example, it will try to decode with utf-8, the wrong place with a problem token instead. Your file name may not be fully readable, but you can still open it.

Unicode on UNIX is only crazy when you force everything to use it. But that's not the way Unicode works on Unix. UNIX does not differentiate between Unicode and byte APIs. They are the same and make it easier to handle.

C Locale

C locale appears here in very many times. C locale is a means to avoid POSIX specifications being forcibly applied anywhere. The POSIX compliance operating system needs to support setting LC_CTYPE to let everything use ASCII encoding.

This locale is selected under different circumstances. You mainly find that this locale provides an empty environment for all programs launched from Cron, your initialization programs, and child processes. C locale restores a sound ASCII zone in the environment, otherwise you can't trust anything.

But the word ASCII says it is 7bit encoded. This is not a problem, because the operating system can handle bytes! Any 8bit-based content will work properly, but if you follow the conventions with the operating system, character processing will be limited to the first 7bit. Any information generated by your tool will be encoded in ASCII and used in English.

Note that the POSIX specification does not say that your application should die from flames.

Python3 died in flames.

Python3 has chosen a different position on Unicode than UNIX. Python3 says: Anything is Unicode (by default, unless in some cases, unless we send duplicate encoded data, even so, sometimes it is still Unicode, although it is the wrong Unicode). The file name is Unicode, the terminal is Unicode,stdin and stdout is Unicode, and there are so many Unicode. Because UNIX is not unicode,python3 now the position is that it is right that UNIX is wrong, people should also modify the POSIX definition to add Unicode. In this case, the file name is Unicode, and the terminal is Unicode, so you will not see some errors caused by the byte.

It's not just me saying that. These are the bugs caused by Python's idea of a Unicode-related brain residue:

ASCII is a very slot file name encoding
Use Surrogateescape as the default error handler
Python3 throwing Unicode errors in C locale
LC Ctype=c,pydoc leave an unused state to the terminal

If you google, you can find so many vomit slots. See how many people have failed to install the PIP module because of some characters in changelog, or because of the home folder, or because the SSH session is ASCII, or because they are connected using putty.

Python3 Cat

Now start repairing cat for Python3. How do we do that? First, we need to process the bytes, because some things might show something that doesn't match the shell code. So anyway, the file content needs to be bytes. But we also need to open the basic output to allow it to support bytes, which is not supported by default. We also need to handle some situations, such as the Unicode API failure, because the encoding is C. So this is the cat with the Python3 feature.

Import Sysimport shutil def _is_binary_reader (Stream, Default=false): Try:return isinstance (stream.read (0), bytes) E Xcept exception:return default Def _is_binary_writer (Stream, Default=false): Try:stream.write (b ') except except Ion:try:stream.write (") return False except Exception:pass return default return True def get _binary_stdin (): # Sys.stdin might or might not being binary in some extra cases. By # Default It's obviously non binary which is the core of the # problem and the docs recomend changing it to binary fo R Such # cases so we need-deal with it.  Also someone might put # Stringio there for testing. Is_binary = _is_binary_reader (Sys.stdin, False) if Is_binary:return sys.stdin buf = getattr (Sys.stdin, ' buffer ', Non e) If BUF is not None and _is_binary_reader (buf, True): Return buf raise RuntimeError (' do not ' manage to get binary s Tdin ') def get_binary_stdout (): If _is_binary_writer (Sys.stdout, False): Return SYS.STdout buf = getattr (sys.stdout, ' buffer ', none) if BUF is not None and _is_binary_writer (buf, True): Return buf Rais E RuntimeError (' did not manage to get binary stdout ') def filename_to_ui (value): # The Bytes branch was unecessary for *th  is* script but otherwise # necessary as Python 3 still supports addressing files by bytes # through separate APIs. If isinstance (value, bytes): value = Value.decode (sys.getfilesystemencoding (), ' replace ') Else:value = Value.encod E (' Utf-8 ', ' surrogateescape ') \. Decode (' Utf-8 ', ' replace ') return value binary_stdout = get_binary_stdout () for file Name in sys.argv[1:]: if filename! = '-': try:f = open (filename, ' RB ') except IOError as Err:print (' Cat . PY:%s:%s '% (filename_to_ui (filename), err), file=sys.stderr) Continue else:f = Get_bina Ry_stdin () with F:shutil.copyfileobj (F, binary_stdout)

This is not the worst version. Not because I want things to be more complicated, it's so complicated now. For example what is not done in the example is to read a binary thing that is forced to clean up the text stdout. In this case, it's not necessary because the print call goes to stderr instead of stdout, but if you want to print some stdout, you have to clean it up. Why? Because stdout is a buffer above another buffer, if you do not force clean it, your output order may be faulted.

Not only me, for example: Twisted's Compat module, will find the same trouble.

Jumping Code Dance

To understand the command-line arguments in the shell, by the way, some of the worst cases in Python3:

The shell passes the file name in bytes to the script
Bytes are decoded by Python in the expected decoding mode before hitting your code. Because this is a bad process, Python3 uses a special error handler to handle decoding errors.
The Python code handles a file with no errors and needs to format an error message. Because when we write the text stream, if it is not illegal Unicode, it is not written as an alternative.
Encodes the Unicode string containing the substitution as utf-8, and then tells it to handle alternative escapes.
Then we decode from Utf-8 and tell him to ignore the error.
The resulting string goes back to the text-only stream
The terminal will then decode our strings for display.

Here's what's happening in Python2:

The shell passes the file name as a byte to the script
Shell decodes a string to display

Because the string processing in the Python2 version is only corrected when an error occurs, the shell can do a better job of displaying the file name.

Note that this does not make the script more wrong. If you need to do the actual string processing of the input data, you will have to switch to Unicode processing in 2.x and 3.x. But in that case, you also want your script to support a-charset parameter, so the work done on 2.x and 3.x is similar. Just a little worse on 3.x, you need to build binary standard output that is not required on 2.x.

But you were wrong.

Obviously I was wrong and I was told this:

I feel pain because I don't think like a beginner, the new Unicode system will be more friendly to beginners
I don't consider how much the Windows user and the new text model are improving for Windows users
The problem is not in Python, the problem is in the POSIX specification
Linux distributions need to start supporting c.utf-8 because they have been hampered by the past
The problem is that SSH sent the wrong encoding. SSH needs to fix this problem.
A lot of Python3. The real problem with Unicode errors is that people don't pass explicit coding and assume that Python3 made the right decision.
I work with the decomposition code, which is obviously more difficult in Python3.
I should be improving Python3 instead of complaining on Twitter and blogs.
You make a problem where there is no problem. It's good to have everyone fix their environment and encode everything. This is the user's problem.
Java has had this problem for many years, which is no problem for developers.

Do you know? I stopped complaining when I was working on HTTP because I accepted the idea that a lot of HTTP/WSGI's problems were common to people. But what do you know? In the case of Hello World there is the same problem. Maybe I should give up. Get a high-quality Unicode-backed library, and that's it.

I can refute the above views, but it doesn't matter at the end. If Python3 is the only Python language I use, I will solve all the problems and use it to develop. There is another perfect language called Python2, which has a larger user base and the user base is very firm. At this time I was very depressed.

Python3 might be strong enough to start getting UNIX to walk the path of Windows: Using Unicode in many places, but I doubt it.

The more likely thing is that people still use Python2 and do something very bad with Python3. Or they'll use go. The language uses a model similar to Python2: Everything is a byte string. and assume that its encoding is UTF-8. To this end.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More