Handle text correctly, especially if Unicode is handled correctly. It's a cliché, sometimes even a seasoned developer. Not because the problem is difficult, but because of the text in the software, the developer does not correctly understand some key concepts and their presentation methods. Search for Unicodedecodeerror related questions on StackOverflow, and you can see that many people have this misunderstanding. The concepts of these errors can be traced back to the advent of Unicode. Many of today's developers are still in the job, including myself. If the concept of these errors is not scattered, it is not a problem. Many people now have these misconceptions, in part because some of the most popular languages spread and even solidify these misconceptions, making it difficult to correct them.
Depending on the support for Unicode, the programming language can be divided into 4 classes:
- The language that was written before Unicode appeared or prevailed. C and C + + fall into this category. The support for Unicode is uneven for such languages. Or not built into the language, or difficult to use correctly. So developers often use the wrong.
- A little better for Unicode support. These languages appear only after Unicode is widely prevalent, but the way Unicode is manipulated in languages is a serious error. Although these languages are born later, they still contain all the shortcomings of the first language. In my experience, one of the languages represented is PHP. Although there are other languages, it's just as bad.
- Support for Unicode is basically correct, but there are a few fatal drawbacks to the language. This type of language is "modern" and understands Unicode, but still does not allow developers to correctly handle Unicode, resulting in some serious deficiencies in Unicode in these languages. To my dismay, Python 2.x falls into this category (detailed below).
- The language of Unicode can be handled correctly. These languages fully support Unicode and can be used in Unicode to quickly and easily complete tasks without error. Java and the. NET platform are part of this type of language.
So, what exactly is Unicode, and what errors have we made on Unicode? Joel this absolute minimum every software developer absolutely, positively must know about Unicode is definitely an article that every software developer must read. In order to be concise and to take care of friends who are not born with patience, I will summarize them in this article.
Characters and bytes
The basic fact is that if you want to handle the text correctly, you must understand the abstract concept of the character. The definition is not rigorous, the character represents a single symbol in the text. More importantly, a character is not a byte. I'll stress it again! A character is not a byte!!! Also, a character has many representations, and different representations use a different number of bytes. As I said earlier, the character is the smallest cell in the text.
Unicode defines a series of characters in a way that everyone recognizes. Unicode can be interpreted as a character database, with each character associated with a unique number, called a code point. In this way, the codepoint of the English capital letter A is U+0041
. And the codepoint of the euro sign is U+20A0
, other similar. A text string is a series of codepoint that represent each character element in a string.
Of course, you will need to store and transfer these theoretical Unicode strings sooner or later. If you choose a way in which other people can understand it in a byte-like manner, you can send text to each other in a way that everyone understands. You need to introduce character encoding (encoding) here.
Character encoding is the mapping between the ideal character and the actual byte representation method. This mapping does not need to be exhaustive, meaning that some specific characters may not be represented in some encoding. There is also no need to use the same memory space for each character, such as some characters that use single-byte encoding, and other characters that require multiple bytes.
Because the same character has more than one byte representation. This means that when a string of bytes is encountered, if you do not know what encoding is used, even if you know that these bytes represent text, you do not know what it means. All you can do is guess the code you use. In short, bytes are not text. Even if you forget all the contents of the article, remember this sentence. In order to read and write text, it is ultimately necessary to know the encoding used, whether it is from the Convention, identifying information, or other methods.
How Python handles Unicode
From here, we introduce the Unicode support for Python. In the Python type hierarchy, there are 3 different string types: "Unicode", which represents a Unicode string (text string), "Str", which represents a byte string (binary data); "Basestring". Represents the parent class for the first two string types. In my opinion, Python made a mistake here, according to the previous definition, which makes Python a third-class language, not a fourth class.
I used a long, well-written emphasis that bytes and characters are inherently different, and can only be converted by character encoding. Unfortunately, Python makes two unrelated mistakes, and it's easy for you to forget about them.
The severity of the first error is debatable: treats a string of bytes as strings. It is debatable whether this should be done. Java and, net think this is wrong, while others hold the opposite attitude. In any case, you might want to do something about the text, such as regular matching, string substitution, and so on. It is meaningless to apply these operations to the sequence of bytes. Python treats the byte sequence as a string of another type, allowing the same action to be performed on both.
The second error is much more severe, and Python attempts to convert between the byte string and the string in a way that is not perceived. In different transformations, Python attempts to convert directly into byte strings and Unicode strings, where conditions permit. For example, when you concatenate a byte string and a Unicode byte together. According to the previous introduction, it makes no sense to convert between different types without using encoding. So Python relies on a "default encoding" that is specified by the encoding sys.setdefaultencoding()
. On most platforms, the default is ASCII encoding. But for all conversions, using this encoding is almost always wrong. This default encoding is used when you do not specify the encoding manually, or if the str()
unicode()
function takes a string as a parameter, but passes other types of arguments.
One solution to this Unicode dilemma is that the call sys.setdefaultencoding()
sets the default encoding to the code that will actually be used. But it just hides the problem, although it's just beginning to solve some text processing problems. But there is a lack of practicality, because many applications, especially Web applications, use different text encodings in different places.
The correct workaround is to modify the code to handle the text in the correct way. Here are some of the guiding ideas that should be done:
- All text strings should be of the Unicode type, not the str type. If you are working with text, and the variable type is str, this is the bug!
- To decode the byte string into strings, you need to use the correct decoding, that is,
var.decode(encoding)
(for example, var.decode(‘utf-8‘)
). Encodes a text string into bytes, using Var.encode (encoding).
- Never use a Unicode string
str()
, or use a byte string without specifying an encoding unicode()
.
- When the app reads data externally, it should be treated as a byte string, which is the type of STR, followed by
.decode()
a call to interpret it as text. Similarly, when you send text to the outside, it is always called on the text .encode()
.
- If you use string literals in your code to represent text, you should always have a ' u ' prefix. In practice, however, you should never define the original string literals in your code. Anyway, I hate this one myself, and maybe others are just like me.
By the way, Python 3 fixes these problems by correctly handling Unicode and strings, so Python is in the fourth class, and more information is available in the official update notes about Unicode.
Hopefully these things will help you, and if you have doubts about what Unicode is and how to handle Unicode, it should be clear now. The next UnicodeEncodeError
time UnicodeDecodeError
you encounter or error, you should know exactly where the problem is and how to fix it!
python-correct use of Unicode