Python character encoding
This article summarizes the python character encoding problems encountered in practical applications and develops a set of coding conventions to avoid coding errors.
You need to summarize the questions that have been done on soj when writing the soriobadian, and prepare to write a question on soj in the summary process. The question solution is readable by python, that is, python can directly use the eval format for ease of processing. I always copy the question id on soj. The title is not very convenient, so I want to automatically generate an empty question, which contains the questions I have done. However, you can only get the id list of your questions from soj, and there is no other information. The missing information can be abstracted as a soj database, which contains a table with id as the primary key. The table contains all the information about the question. The code is divided into two parts: the soj tool, which contains database operations, the list of AC questions obtained based on the id, and the soj tool, which is responsible for the operations of problem data.
It was relatively late to learn python. At that time, I was faced with the choice of python2 and python3. After reading some differences, I felt that the design of python3 was more reasonable, while python2 was relatively casual. In particular, the concept of python2 is unclear. mixing the byte sequence with the string leads to confusion. First, use the tool written in python3, and then use the Code as the web Background to display the results in the form of a webpage. This involves python2, because python2 is the main tool in the production environment. In the process of converting to python2, You have to review the string and summarize this article.
In python2, the string and unicode string types are used, while in python3, the bytes and string types are used. They are equivalent to: python2.string = python3.bytes, and python2.unicode string = python3.string. In python2, the reason for the confusion of string-related things is that the name is incorrect: to be partial, use the string name table to conform to the concept of bytes.
Bytes concept:
Bytes expresses the sequence of bytes. The meaning of the data itself is very limited. The encoding problem exists only when the data is organized according to certain rules and then expresses a certain concept, data makes sense.
For example, we can think that every 4 bytes of bytes represents a 32-bit integer, where every 8 consecutive digits of an integer are put together as one byte, and the low bit is put in front, such bytes is encoded, it expresses the concept of a 32-bit integer.
We can also think that bytes represents a UTF-8 encoded string. Every 1 to 6 bytes corresponds to a unicode character. We call bytes utf8 encoding, which expresses the concept of a string.
The concept of string:
To express the string concept, [Char] is more accurate (character list ). I don't know how much storage a Char occupies. It can be utf16, expressed as a Char in 2 or 4 bytes, or utf32 expressed as a Char in 4 bytes, or utf8 expressed as a Char in 1 to 6 bytes, this is a real thing and should not affect our use.
Therefore, we need to specify the encoding when bytes is used to represent the string, convert the string to bytes, and the corresponding function is encode. When we think that bytes has a character encoding expression that represents a string, we can use decode and specify the encoding to get the string. Strictly speaking, in python2, we should not call the encode method on the string object, nor call the decode method on the unicode string object.
Furthermore, any abstraction can get the corresponding bytes through encode and get the corresponding abstraction through decode.
Based on the above, some encoding conventions are introduced to avoid coding errors.
1. source code encoding. This encoding uses # coding: xxx to indicate that the concepts in python2 and python3 are clear.
Conventions ):
The source code only uses UTF-8 encoding.
2. string literal type and encoding:
A string literal in the form of "xxxx" in python2 has the string type, and the encoding is consistent with the source code encoding.
The string literal in the format of u "xxxx" has the unicode string type, and an automatic source code encoding process is implemented to the unicode string decode. If the conversion fails, an error is reported. In this case, you need to check the encoding indicated by the annotation and the real encoding of the source code.
In python3, a string literal in the form of "xxxx" has the string type. At this time, there is an automatic source code encoding process to the unicode decode.
The string literal of B "xxxx" has the bytes type, which indicates part of the source code. Therefore, when the content is a string, we also call it a string consistent with the source code.
Usage:
In python2, strings are used to express the concept of strings and UTF-8 encoding is used. In python2, when unicode string and string are mixed, string is considered to have source code encoding and decode is unicode string. As a result, there is a trap, and the type of the variables used is out of control. I don't know whether it is string or unicode string, so pay special attention here. If you use string to express the concept of a string and other encodings, it is also possible to see if the character set and application corresponding to the encoding can be well combined.
Python3 represents the concept of a string and does not care about encoding. Python3 does not have a mix problem. If bytes and string are used together, an error is returned.
3. Processing after urlopen (xxx). read:
It is clear that the concept returned here is python3.bytes.
If the returned content is the text of a webpage, call decode (encoding = 'webpage Code', errors = 'ignore') to obtain the corresponding string.
However, note that in python2, the type of our string is python2.string, so in python2, we also need an encode process: encode (encoding = 'source code encoding ', errors = 'ignore ')
4. Write data files:
A special file is written here, which can be considered as a piece of python code and executed. Therefore:
Usage:
File encoding uses utf8.
In python2
With open (file, 'wb ') as tempf:
Tempf. write (data)
With open (file, 'w') as tempf:
Tempf. write (data)
Yes, because strings cannot distinguish between bytes and strings.
In python3, use:
With open (file, 'wb ') as tempf:
Tempf. write (data. encode (encoding = 'utf8', errors = 'ignore '))
Or
With open (file, 'w') as tempf:
Tempf. write (data)
Because the former is written to bytes and the latter is string.
The above feasibility is obtained when we agree that both the data file and the source code encoding are utf8. Without these conventions, let's look at the semantics of these codes:
The semantics of the two codes in python2 is messy:
If we know that data is a string, the first part of the code is incorrect theoretically, but in fact both pieces of code have completed writing data to a file, and the data file encoding is consistent with that of data.
If we know that data is a binary stream, the second part of the code is incorrect theoretically, but both pieces of code actually complete the task. There is no encoding problem in the data file.
The semantics of the first code in python3 is that the data string is written, and the data file encoding is utf8.
The syntax of the second code is: Write the data string, and the data file encoding is consistent with the current source file encoding (in this case, the default encoding is equivalent to the current source file encoding.
5. Read data files:
Usage:
In python2
With open (file, 'rb') as tempf:
Tempf. read ()
And
With open (file, 'R') as tempf:
Tempf. read ()
Yes, because the former returns some bytes, but uses string as the container. The latter returns a string.
In python3, use:
With open (file, 'rb') as tempf:
Tempf. read (). decode (encoding = 'utf8', errors = 'ignore ')
Or
With open (file, 'R') as tempf:
Tempf. read ()
Because the open mode is different, the former returns bytes, and the latter returns string (try bytes internally to use the source file encoding for decoding ).
The feasibility of the same Code is also based on several conventions. If these conventions are not available, let's look at their semantics:
In python2, the semantics of the first code is to read back some binary data.
In python2, the semantics of the second code is to read some strings. (In theory, there will be a process of first reading the binary and then encoding the decode according to the current default encoding and then encoding the encode to the current default. However, here we think that the two transformations are constant, so there is no action. Obviously, in the case of invalid characters, the two transformations are not constant .)
In python3, the semantics of the first code is to read binary data and decode according to utf8.
In python3, the semantics of the second code is to implicitly read binary data and decode according to the current default encoding.
In my application scenarios, the data read back has an eval process. Obviously, python2 may not have an Encoding Error until eval, and python3 can have an earlier error.