The coding problem of Python has never been solved, and finally it is clear from this blog post today.
Original address: http://nedbatchelder.com/text/unipain.html
Address: http://pycoders-weekly-chinese.readthedocs.org/en/latest/issue5/unipain.html
Translator: yudun1989
Practical Unicode Programming Guide
This is the speech I made at Pycon2012. You can read the slides and text on this page, or open a demo directly in the browser, or watch live video.
At the same time, click on the image of the article will go to the corresponding location of the slide, the picture uses the Symbola font, but if you want to display some special symbols, you need to first download the font.
Hello, I'm Ned Batchelder. I've had a decade of Python programming experience, which I think, many, many times, I have made many Unicode coding mistakes like other programmers.
If you're like any other Python programmer, you've certainly come across something like this: You've written a very nice piece of code, and things look pretty smooth. Then one day a very strange "dialect character" does not know where to come from, and your program began to emerge in a large number of unicodeerrors.
You seem to know how this kind of problem should be solved, so you add encode and decode in the wrong place, but Unicodeerror starts to appear in other places. So you add decode or encode in another place. The problem seems to have been solved after you played a game called "coded Hamster".
One day later, another "dialect character" appeared in another place. Then you have to go and play with the "gopher" until the problem is solved.
Now your program is finally ready to run. But you're upset and uncomfortable, and the problem takes too much time, and you know that the solution is "right" and you start hating yourself. Your primary understanding of Unicode is that you hate it.
You don't want to know about weird character sets, you just want to write a program that you don't think is very bad.
You don't have to play the Gopher game. Unicode can be a bit of a hassle, but it's not difficult. Knowing the relevant knowledge and practicing it, you can also easily and elegantly solve the related problems.
Next I will teach you five Facts of Lie, and then give you some professional advice to solve the Unicode problem. The following content will contain the basics of Unicode, and how to do it in Python 2 and Python 3. They have a certain difference, but the basic strategy you use is the same.
World &unicode
We start with the basic knowledge of Unicode.
One of the facts: Everything in the computer is bytes (bytes). The files on the hard disk are made up of a series of byte, and only byte is transmitted in the network. All the information, in and out of the program you write, is made up of byte.
The isolated byte is meaningless, so let's give them meaning.
In order to represent a variety of words, we have about 50 years to use ASCII code. Each byte is given a type of 95 symbols, so when I send you a byte value of 65, you know I want to express a lowercase a.
ISO Latin 1, or 8859-1 extended to more than 96 characters of ASCII. This is probably the most thing you can do with a byte. Because there is no capacity in byte to store more symbols.
In Windows, there are 27 additional characters added. This is called CP1252 encoding.
The second fact is that the word Fu Yuan in the world is far more than 256. A simple byte is not able to express world-wide characters. When you play "coded hamster", how you want everyone in the world to speak English, but that's not the case, people need more symbols to communicate.
The fact that one and two jointly caused a conflict between the structure of computer equipment and the human needs of the world.
There were many ways of trying to resolve the conflict. By encoding a byte with a symbol or character, each solution does not solve the real problem in fact two.
There were a number of byte encodings that did not solve the problem. Each one can only solve a part of the human language. But they can't solve all the writing problems.
People are starting to create two byte character sets, but still like fragments, they can serve only a subset of people in different regions.
Different standards were produced at the time, and ironically, they were not enough to meet all the symbolic needs.
Unicode is intended to solve the previous old character set problem. The Unicode assignment is shaped as a code point (the Unicode character is used as the code point (POINTS) to represent the character with the XXXX followed by the U, where X is 16 characters). He has 1.1 million code points, of which 110,000 are occupied, so she can have a lot of space available for future growth.
The purpose of Unicode is to include everything, starting with ASCII, containing thousands of code, including the famous-Snowman??, contains all the world's writing systems, and has been expanded. For example, in the latest update, there is a lot of useless vocabulary.
Here are six exotic Unicode characters. The Unicode code point is written as a 4-, 5-, or 6-bit hexadecimal encoding with a prefix of U. Each character has a name that is specified in ASCII characters.
So Unicode provides the space for all the characters we need. But we still need to deal with the problem of fact: the computer can only read bytes. We need a way to represent Unicode using bytes so that they can be stored and propagated.
The Unicode standard defines a number of ways to use bytes to represent code points and become encoding.
UTF-8 is one of the most popular encoding methods for propagating and storing Unicode. It uses a different bytes to represent each code point. ASCII characters each need only one byte, which is the same as ASCII encoding. So, ASCII is a subset of UTF-8.
Here we show a few strange characters of the UTF8 representation method. ASCII characters H and I can be represented with only one byte. The other uses two or three bytes depending on the code point. Although some are not commonly used, some code points use up to four bytes.
Python 2
Well, with so much theoretical knowledge, let's talk about Python 2.
In Python2, there are two types of string data. A plain old text: the "str" object, which stores bytes. If you use a "u" prefix, you will have a "Unicode" object that stores the code points. In a Unicode string, you can use the backslash u (u) to insert any Unicode code point.
You can notice that the word "string" is problematic. Both "str" and "Unicode" are a kind of "string", which would appeal to all of them as strings, but in order to directly or explicitly differentiate them.
If you want to convert between Unicode and bytes, each will have a method. The Unicode string will have a . Encode method to generate bytes, and the bytes string will have a . Decode method to produce Unicode. Each method will have a parameter that indicates the type of encoding you want to manipulate.
We can define a Unicode string called My_unicode, and then look at these nine characters, we use the Encode method to create the My_unicode bytes string. There will be 19 bytes, as you would expect. Bytes string to decode will get utf-8 string.
Unfortunately, encode and decode will produce an error if the specified encoding name is incorrect. Now we're going to try to encode some of our weird characters to ASCII and it will fail. Because ASCII can only represent one of 0-127 characters. However, our Unicode string is already out of range.
The exception thrown is Unicodeencodeerror, which shows the encoding you are using, "codec" is the encoding, decoder, which shows the position of the character causing the problem.
Decoding will also know some problems. Now we're going to decode a UTF-8 string into ASCII, get a unicodedecodeerror, for the same reason that ASCII only accepts values within 127, and our UTF-8 string is out of range.
Although UTF-8 cannot decode any of the bytes strings, we try to decode some spam messages. A unicodedecodeerror error is also generated. Finally, the UTF-8 advantage is that the effective bytes string will help us to create a highly robust system: if the data is not valid, the data will not be accepted.
When encoding or decoding, you can indicate what happens if codec is not able to process the data. The second argument at encode or decode indicates the rule. The default value is "strict", meaning that an exception will be thrown as if it were just the same.
The "Replace" value means that a standard substitution character will be returned. When coding, the substitution value is a hello, so any value that cannot be encoded will produce a "?".
Some other handler are very useful. "Xmlcharrefreplace" will produce a completely substituted html/xml character, so u01b4 will become "& #436" (because the hexadecimal 01b4 is 436 decimal). This is useful if you need to output the returned value to an HTML file.
Note that different error handling methods are used for different error reasons. "Replace" is a self-defense way of handling data that cannot be parsed, and data is lost. "Xmlcharrefreplace" protects all raw data and outputs the data when the XML escape character is available.
You can also specify how to handle errors when decoding. "Ignore" will simply throw away the bytes that cannot be decoded. Replace will add the Unicode u+fffd directly to the problematic bytes, replacing it with the replacement character. Note Because the decoder cannot decode the data. It does not know exactly how many Unicode characters there are. Decoding our UTF-8 string into ASCII made 16 "substitution characters". Each byte cannot be parsed and is replaced. These bytes however want to represent only 6 Unicode characters.
Python 2 has tried to be useful when dealing with Unicode and byte strings. If you're the one that wants to combine a Unicode string with a byte string, Python 2 will automatically decode the byte string into a Unicode literal. resulting in a new Unicode string.
For example, we want to connect the Unicode string "Hello" and a byte string "World". The result is a Unicode "Hello World". In our opinion. Python 2 Decodes "world" using ASCII codec. This time the value of the character set used in decoding is equal to the value of sys.getdefaultencoding ().
The character set in this system is ASCII, because this is the only guess: ASCII is so widely accepted. It is a subset of so many encodings. It's not like it's wrong.
Of course, these hidden encoding transformations are not immune to decoding errors. If you want to concatenate a byte string and a Unicode string, and the byte string cannot be decoded into ASCII, a unicodedecodeerror will be thrown.
This is the circle of the hateful unicodeerror. Your code contains Unicode and byte strings, so long as the data is all ASCII, all conversions are correct, and once a non-ASCII character sneaks into your program, the default decoding will fail, causing unicodedecodeerror errors.
The philosophy of Python 2 is that Unicode strings and byte strings are chaotic, and he tries to reduce your burden by automatic conversion. Just like the conversion between int and float, the conversion of int to float does not fail, and a byte string to a Unicode string fails.
Python 2 Quietly masks the byte-to-Unicode conversion, making it easier to process ASCII. The price of your comeback is that it will fail when dealing with non-ASCII.
There are many ways to merge two strings, all of which parse byte into Unicode, so you have to be careful when dealing with them.
First we use the ASCII format string, and Unicode to combine. Then the final output will become Unicode. Returns a Unicode string.
Then we swap two: A Unicode-formatted string and a byte string are merged again, generating a Unicode string, because the byte string can be decoded to ASCII.
Simply printing out a Unicode string will invoke the implicit encoding: The output is always bytes, so it must be encoded into a byte string before Unicode is printed.
The next thing is very incomprehensible: we make a byte string encoded into UTF-8, but get a mistake saying can't be decoded into ascii! The problem here is that the byte string cannot be encoded, so remember that the encoding is that you turn Unicode into a byte string. So if you want to do your work, Python2 needs a Unicode string that implicitly decodes your string into ASCII.
Finally, we encode the ASCII string into UTF-8. Now we do the same implicit encoding operation because the string is ASCII and the encoding succeeds. and encodes it into UTF-8, printing out the original byte string, because ASCII is a subset of the UTF-8.
The three most important facts: Byte and Unicode are very important, and you have to handle all two of them. You cannot assume that all strings are byte, or that all strings are Unicode, and you have to use them appropriately to convert them when necessary.
Python 3
We have seen the pain in Python version 2 for Unicode. Now let's look at Python 3, the most important change in Python 2 to Python 3 is their processing of Unicode.
Like Python 2, Python 3 also has two types, one Unicode and one byte. But they have different names.
Now that you have converted from plain text to "str" type, you are storing a Unicode, and the "bytes" type stores a byte string. You can also create a byte string with a B prefix.
So "str" in Python 2 is now called "bytes", while the "Unicode" in Python 2 is now called "str". This is easier to understand than Python 2 because Unicode is what you always want to store. and the bytes string is only available when you want to handle byte.
The biggest change to Unicode support in Python 3 is that there will be no automatic decoding of byte strings. If you want to link a byte string with a Unicode, you will get an error, no matter what content you include.
All of this will be handled implicitly in Python 2, and you will get an error in Python 3.
In addition, if a Unicode string and a byte string contain the same ASCII code, the two will be considered equal in Python 2 and not in Python 3. The result of this is that the keys in Unicode cannot find the values in the byte string, and vice versa, however it is possible in Python 2.
This completely changed the source of Unicode pain in Python 3. In Python 2, as long as you use ASCII data, mixed Unicode and Byte will succeed, and Python 3 will simply ignore the data and fail.
In this case, in Python 2, you think your program is correct, but you finally find that errors that fail due to some special characters are avoided.
In Python 3, your program will immediately generate an error, so even if you are dealing with ASCII code, you must also deal with the relationship between bytes and Unicode.
The processing of bytes and Unicode in Python 3 is very strict and you are forced to deal with these things. This has been controversial.
One of the reasons for this is to change the read file, Python has two ways to read the file, one binary and one text. In Python 2, it only affects the end-of-line notation, even on Unix systems, with little difference.
In Python 3. Both of these modes will return different results. When you open a file in text mode, whether you are using the "R" mode or its default mode, the files that you read are automatically transcoded into Unicode, and you will get the Str object.
If you open a text in binary mode and enter "RB" in the parameter, then the data read from the file will be bytes, without any processing.
Implicit processing of bytes to Unicode uses locale.getpreferedencoding (), however it is possible to output the results you do not want. For example, when you read Hi_utf8.txt, he is decoded into the encoding that is set in the language preference, and if we create these examples in Windows, then it is "cp1252". Like ISO 8859-1, CP-1252 These can get arbitrary byte values, so they don't throw unicodedecodeerror, and of course it means they decode the data directly into CP-1252, creating spam that we don't need.
In order for the file to be read correctly, you should indicate the desired encoding. The Open function is now able to indicate the encoding by parameter.
Relieve pain
OK, so how to reduce the pain? The good news is that the rules for relieving pain are very simple, and are more applicable in Python 2 and Python 3.
As we have seen in the fact, there is only bytes in and out of your program, but you do not have to deal with all the bytes in your program. The best strategy is to decode the input bytes immediately into Unicode. You use Unicode in your program and encode it as bytes as you work on the output.
Make a Unicode sandwich, bytes outside, Unicode.
To reporters, sometimes some libraries will help you do something like that. Some libraries may let you enter Unicode and output Unicode. It will help you to complete the conversion function. Django, for example, provides Unicode in its JSON module.
The second rule is. You need to know what type of data you're dealing with, and in any location in your program, you need to know if you're dealing with a byte string or a Unicode string. It can not be a guess, but should be designed well.
Also, if you have a byte string, if you want to deal with it. Then you should know what kind of code he is.
When you debug your code, you can't just print it out to see its type. You should look at its type, or look at the value after it repr to see what type of data you have.
I have said that you should know the encoding type of your byte string. Okay, here, let me tell the truth. Four: You cannot judge the type of this string encoding by examining it. You should know it by other means. For example, many protocols will indicate the type of encoding. Here we give an example of HTTP, HTML, XML, and Python source files. You can also learn about encodings by specifying them in advance. For example, the data source code may indicate the code.
There are some ways to guess some of the bytes encoding types. But it's just speculation. The only way to be sure is through other means.
Here is a coded guess that gives some weird characters. We use some characters of the UTF-8 convenience store, which are decoded by different decoding methods after the output. You can see. Sometimes decoding with incorrect decoding may output the correct character, but it will output the wrong characters. Your program cannot tell you these parsing errors. You will find errors only when the user perceives them.
This is a good example of fact four: the same bytes flow through different decoders can be decoded. The bytes itself cannot indicate which encoding it uses.
By the way, the display of these spam messages follows only one rule, which is garbled.
Unfortunately, the bytes stream is encoded differently depending on its source, and sometimes the encoding that we specify may be wrong. For example, you might be able to fetch an HTML from the Web, the HTTP header indicates that the encoding is 8859-1, but the actual encoding is UTF-8.
In some cases, the encoding mismatch may produce garbled characters, and sometimes unicodeerror.
Alone. You should test your Unicode support. For this. The first thing you should do in your code is to first extract the Unicode. If you only speak English, this may be a little difficult. Because some Unicode data is more difficult to read. Fortunately, most of the time the Unicode strings of some complex structures are more readable.
Here is an example. Text that can be read in ASCII text, and inverted text. Some of these texts are sometimes pasted into social networks by some young people.
Depending on your program, you are likely to dig deeper in the Unicode path. There's a lot of detail here that I don't have to explain clearly. can be involved. We call it fact five. Because you don't have to go into it too detailed.
To review, we have five facts that cannot be ignored:
- All inputs and outputs in the program are byte
- The world's texts require more than 256 symbols to behave
- Your program must handle byte and Unicode
- The byte stream does not contain encoded information
- The indicated encoding may be wrong
Here are three tips for keeping Unicode clean in your programming:
- Unicode sandwich: Whenever possible, the text that your program handles is Unicode.
- Get to know your string. You should know what Unicode is in your program, what is byte, and for these byte strings, you should know what their code is.
- Test Unicode support. Use some strange symbols to test whether you have done the above.
If you follow the above advice, you will write code that is very good for Unicode support. No matter how irregular the encoding is in Unicode, your program will not hang out.
Some other resources that you might need
Joel Spolsky wrote the Absolute Minimum every software Developer absolutely, positively must Know about Unicode and characte R Sets (No excuses!) summarizes how Unicode works and why. There's no Python content, but it's more detailed than I explained!
If you need to deal with some semantic Unicode character issues. Then the Unicodedata module may help you a little.
If you want to find some Unicode to test, the various coded text calculators on the web will help you.
Python Learning Four Bullets: Coding issues (reprint)