All Python3unicode features you don't want to know

Source: Internet
Author: User
My readers know that I am a person who likes Python3unicode. This is no exception. I will tell you how painful it is to use unicode and why I cannot shut up. I spent two weeks studying Python3 and need to vent my disappointment. There is still useful information in these scolds because it teaches us how to deal with Python 3. If I am not bothered, read it and my readers know that I am a person who likes Python3 unicode. This is no exception. I will tell you how painful it is to use unicode and why I cannot shut up. I spent two weeks studying Python3 and need to vent my disappointment. There is still useful information in these scolds because it teaches us how to deal with Python 3. If you are not bothered by me, read it.

The content of this discussion will be different. It will not be associated with WSGI or HTTP and related things. Generally, I was told that I should stop complaining about the Python3 Unicode system, because I did not write the code that others often write (such as the HTTP library), so I am going to write something else this time: A command line application. I wrote a very convenient library called click to make it easier to write.

Note: What I do is what every novice Python programmer does: write a command line application. Hello World program. However, unlike in the past, I want to ensure that the application is stable and supports Unicode for Python2 and Python3, and can also perform unit tests. So the next step is how to implement it.

What do we want to do

In Python3, Unicode must be used as a developer. Obviously, I think this means that all text data is Unicode and all non-text data is byte. In such a wonderful World, everything is black and white. the Hello World example is very straightforward. So let's write some shell tools.

This is an application implemented in the form of Python2:

import sysimport shutil for filename in sys.argv[1:]:  f = sys.stdin  if filename != '-':    try:      f = open(filename, 'rb')    except IOError as err:      print >> sys.stderr, 'cat.py: %s: %s' % (filename, err)      continue  with f:    shutil.copyfileobj(f, sys.stdout)

Obviously, commands are not very good in processing any command line options, but can be used at least. So let's start code.

UNICODE in UNIX

The above code does not work in Python2, because you secretly process bytes. The command line parameter is a byte, the file name is a byte, and the file content is also a byte. The language guard will point out that this is not correct, which will lead to problems, but if you start to think more about it, you will find this is an unfixed problem.

UNIX is a byte, which has been defined as this, and will always be like this. To understand why you need to observe different scenarios of data transmission.

  • Terminal
  • Command line parameters
  • Operating system input/output layer
  • File system driver

By the way, this is not the only thing that may pass data, but let's take a look at how many scenarios we can understand a code. The answer is none. At least we need to understand that the encoding is the terminal output region information. This information can be used to display the conversion and understand the encoding of text information.

For example, if the LC_CTYPE value is en_US.utf-8, it tells the application system to use US English and most of the text data is UTF-8 encoded. There are actually many other variables, but we assume this is the only one we need to see. Note that LC_CTYPE does not mean that all data is UTF-8 encoded. Instead of notifying applications about how to classify text features and when to convert applications.

This is important because c locale. C locale is the only field specified by POSIX. it indicates that all ASCII codes and responses from the command line tool are treated according to the POSIX spec.

In our cat tool above, if it is bit, there is no other way to treat the data. The reason is that the shell does not specify what the data is. For example, if you call cat hello.txt, the handler will encode hello.txt during the application process.

But now let's look at this example echo *. Shell will pass all the file names in the current directory to your application. What encoding are they? The file name is not encoded!

UNICODE crazy

Now, a Windows user can see what UNIX people are doing. But this is not miserable. The reason for this is that some smart people have designed this system to be backward compatible. Unlike Windows, each API is defined twice. on POSIX, the best solution is to assume that it is byte for display purposes and encode it in the default encoding method.

The following cat command is used as an example. For example, there is an error message about files that cannot be opened, because they do not exist, they are protected, or for any other reason. We assume that the file is encoded with latin1 because it is an external driver from 1995. The handler will get the standard output, and it will try to use UTF-8 encoding, because this is the encoding it considers. Because the string is latin1 encoded, it cannot be decoded. But don't be afraid, there will be no crashes, because your terminal will ignore it when it cannot handle it.

What is it like on the graphic interface? Each type has two versions. List all files on a graphic interface like Nautilus. It associates the file name with the icon, double-click the icon, and tries to display the file name, so it is decoded. For example, it will try to use UTF-8 decoding, and use the question mark to replace the error. Your file name may not be fully readable, but you can still open the file.

Unicode on UNIX is crazy only when you force everything to use it. But that's not how unicode works on UNIX. UNIX has no difference between unicode and byte APIs. They are the same, making it easier to handle.

C Locale

C Locale appears many times here. C Locale is a means to prevent POSIX specifications from being forcibly applied anywhere. POSIX complies with the operating system and must support LC_CTYPE to enable ASCII encoding for everything.

This locale is selected under different circumstances. You mainly find that locale provides an empty environment for all programs started from cron, and your initialization programs and sub-processes. C Locale restores a sound ASCII zone in the environment, otherwise you cannot trust anything.

However, the word ASCII indicates that it is 7-bit encoded. This is not a problem, because the operating system can process bytes! Any 8bit-based content can be processed normally. However, if you follow the conventions with the operating system, character processing is limited to the first 7bit. Any information generated by your tool is encoded in ASCII and in English.

Note that the POSIX specification does not mean that your application should die in flames.

Python3 died in flames

Python3 chooses a different position from UNIX in unicode. Python3 said: Everything is Unicode (by default, unless in some cases, unless we send repeatedly encoded data, even so, sometimes it is still Unicode, unicode ). The file name is Unicode, the terminal is Unicode, stdin and stdout are Unicode, there are so many Unicode. Because UNIX is not Unicode, Python3 currently stands for the incorrect UNIX. people should also modify the POSIX definition to add Unicode. In this case, the file name is Unicode, and the terminal is Unicode, so that some errors caused by bytes will not be seen.

Not just me. These are bugs caused by Python's brainless thoughts on Unicode:

  • ASCII is the name encoding of a very slot
  • Use surrogateescape as the default error handler
  • Python3 throws a Unicode error in C locale
  • Lc ctype = C, pydoc left an unusable status for the terminal

If you Google it, you will find so many complaints. Check how many people failed to install the pip module because of some characters in changelog, or because the home folder or SSH session uses ASCII, or because they are connected using Putty.

Python3 cat

Now we start to fix cat for Python3. What should we do? First, we need to process bytes, because something may display something that does not conform to shell encoding. Therefore, in any case, the file content must be byte. However, we also need to open the basic output to make it support byte, which is not supported by default. We also need to handle some situations separately, such as Unicode API failure, because the encoding is C. So this is the cat of the Python3 feature.

import sysimport shutil def _is_binary_reader(stream, default=False):  try:    return isinstance(stream.read(0), bytes)  except Exception:    return default def _is_binary_writer(stream, default=False):  try:    stream.write(b'')  except Exception:    try:      stream.write('')      return False    except Exception:      pass    return default  return True def get_binary_stdin():  # sys.stdin might or might not be binary in some extra cases. By  # default it's obviously non binary which is the core of the  # problem but the docs recomend changing it to binary for such  # cases so we need to deal with it. Also someone might put  # StringIO there for testing.  is_binary = _is_binary_reader(sys.stdin, False)  if is_binary:    return sys.stdin  buf = getattr(sys.stdin, 'buffer', None)  if buf is not None and _is_binary_reader(buf, True):    return buf  raise RuntimeError('Did not manage to get binary stdin') def get_binary_stdout():  if _is_binary_writer(sys.stdout, False):    return sys.stdout  buf = getattr(sys.stdout, 'buffer', None)  if buf is not None and _is_binary_writer(buf, True):    return buf  raise RuntimeError('Did not manage to get binary stdout') def filename_to_ui(value):  # The bytes branch is unecessary for *this* script but otherwise  # necessary as python 3 still supports addressing files by bytes  # through separate APIs.  if isinstance(value, bytes):    value = value.decode(sys.getfilesystemencoding(), 'replace')  else:    value = value.encode('utf-8', 'surrogateescape') \      .decode('utf-8', 'replace')  return value binary_stdout = get_binary_stdout()for filename in sys.argv[1:]:  if filename != '-':    try:      f = open(filename, 'rb')    except IOError as err:      print('cat.py: %s: %s' % (        filename_to_ui(filename),        err      ), file=sys.stderr)      continue  else:    f = get_binary_stdin()   with f:    shutil.copyfileobj(f, binary_stdout)

This is not the worst version. Not because I want to make things more complex, but it is so complicated now. For example, what is not done in the example is to forcibly clear the text stdout when reading a binary object. In this example, it is unnecessary because the print call goes to stderr instead of stdout, but if you want to print some stdout, you must clear it. Why? Because stdout is a buffer above other buffers, if you do not force it to be cleared, your output sequence may fail.

Not only me, for example, twisted's compat module, will find the same troubles.

Dance code

To understand the command line parameters in shell, the worst case of Python3 is:

  1. Shell sends the file name in bytes to the script.
  2. Before hitting your code, bytes are decoded as expected by Python. Because this is a lossy process, Python3 uses a special error processor to handle decoding errors.
  3. The Python code processes a file without errors and needs to format an error message. Because when we write a text stream, if it is not illegal unicode, it will not be replaced by writing.
  4. Encode the unicode string that contains the replacement into UTF-8 and then tell it to handle the replacement escape.
  5. Then we decoded the code from UTF-8 and told him to ignore the error.
  6. The result string is returned to a text-only stream.
  7. Then, the handler decodes our string for display.

In Python2:

  1. Shell transmits the file name as byte to the script
  2. Shell decodes strings for display

Because the string processing in Python 2 is only corrected when an error occurs, because shell can do better in displaying the file name.

Note that this does not make the script more incorrect. If you need to process the actual string of the input data, you need to switch to unicode in 2. x and 3. x. But in that case, you also want your script to support a-charset parameter, so the work on 2. x and 3. x is similar. It is only worse on 3.x, and you need to build binary standard output that is not needed on 2.x.

But you are wrong.

Obviously, I am wrong. I was told this:

  • I feel painful because I don't think like a beginner, and the new unicode system will be more friendly to beginners.
  • I don't consider how much windows users and new text models have improved for windows users.
  • The problem lies not in Python, but in the POSIX specification.
  • Linux distributions need to start supporting C.UTF-8 as they have been blocked
  • The problem is that SSH has sent an error code. SSH needs to be fixed.
  • The real problem with a lot of unicode errors in Python3 is that people do not pass explicit encoding and assume that Python3 makes the right decision.
  • I work with the decomposition code. Obviously, this is more difficult in Python3.
  • I should improve Python3 instead of complaining on twitter and blogs.
  • You are creating problems in a safe place. Let everyone fix their environment and code everything. This is a user problem.
  • Java has had this problem for many years, which is no problem for developers.

Do you know? I stopped complaining when I was doing HTTP work, because I accepted this idea, that is, a lot of HTTP/WSGI problems are common to people. But what do you know? The same problem also exists in the case of Hello World. Maybe I should give up on a database with high-quality unicode support.

I can refute the above points, but it doesn't matter in the end. If Python3 is the only Python language I use, I will solve all the problems and use it for development. Another perfect language is Python2. it has a larger user base and a solid user base. At this time, I was very frustrated.

Python3 may be powerful enough to start taking UNIX over Windows: unicode is used in many places, but I doubt this practice.

It is more likely that people still use Python2 and use Python3 to make some bad things. Or they will use Go. This language uses a model similar to Python2: everything is a byte string. And assuming its encoding is UTF-8. This ends.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.