Python character encoding idiomatic method

Source: Internet
Author: User

This paper summarizes the problem of Python character coding in practical application, and sets up a set of coding related conventions to avoid coding errors.

In the writing of the wretched treasure need to summarize soj on the topic, ready to summarize the process by the way to write a soj on the problem. The puzzle is a python-readable format that Python can directly eval to facilitate processing. Write the problem always copy soj on the topic id,title is not too convenient, so ready to automatically generate an empty problem, which contains I have done. However, directly from the SOJ can only get the ID list of their own problems, lack of other information. The lack of information can be abstracted as a SOJ database, which contains a table with the ID key, with all the information in the table. So the code is divided into two parts, part is the SoJ tool, which contains the database operation, according to the ID to get the list of AC topics, and the other part is based on the SoJ tool, responsible for the operation of the data.

Learn Python late, at that time faced with Python2 and Python3 choice, see some differences, feel python3 design more reasonable, and python2 more casual. Especially the string part, the concept of python2 is not clear, the sequence of bytes and strings mixed together, resulting in some confusion. First write the tool with Python3, and then want to put the code as a Web background, the results are displayed in the form of web pages, which involves Python2, because in the production environment or Python2-based. During the transition to Python2, you have to revisit the string and summarize the article.

String and Unicode string types are used in python2, whereas in Python3 they are bytes and string types, they are equivalent: Python2.string = python3.bytes, Python2.unicode string = python3.string. In Python2, the cause of confusion in the sense of string-related things is the naming error: sweeping, using the name of string to express the concept of bytes.


The concept of bytes:
Bytes expressed the sequence of bytes, the meaning of the data itself is very limited, only if the data are organized according to certain rules, and then express a concept, there is a coding problem, the data is meaningful.

For example, we can assume that bytes represents a 32-bit integer per 4 bytes, where integers are placed together as one byte per successive 8 bits, and the lows are placed in front, so that the bytes is encoded, expressing the concept of a 32-bit integer.

We can also assume that bytes represents a UTF8 encoded string that corresponds to a Unicode character every 1 to 6 bytes, and we call bytes with UTF8 encoding, which expresses the concept of a string.


The concept of string:
If you want to express the concept of a string, [Char] is more accurate (a list of characters). As for how much storage a char occupies, it is completely unknown. It can be utf16, with 2 or 4 bytes for a char, or UTF32 to express a char in 4 bytes, or UTF8 to represent a char with 1 to 6 bytes, which is the thing for the implementation, and should not have any effect on our use.

So, when we need bytes to represent a string, we need to specify the encoding, convert the string to bytes, and the corresponding function is encode. When we think that bytes has a character encoding expressed as a string, we pass decode and specify the encoding to get string. Strictly speaking, in Python2, we should not call the Encode method on a String object, and the Decode method should not be called on a Unicode string object.

Further, any abstraction can be encode to the corresponding bytes, and the corresponding abstraction is obtained through decode.

On the basis of the above, some coding conventions are introduced, the goal is to avoid coding errors.


1. Source code, this code is indicated by #coding:xxx that the concepts in Python2 and Python3 are clear.
Customary Law (Convention):
The source code uses only UTF8 encoding.


Types and encodings of 2.string literal:
A string literal in python2, such as "XXXX", has a string type with the same encoding as the source code.
While the shape of the U "xxxx" string literal has a Unicode string type, there is an automatic source code encoding to the Unicode string decode process. Error when unable to convert, this time need to check the code indicated by the comment and the source code of the true encoding.

Python3 a string literal in the form "XXXX" has a string type, at this point, there is an automatic source code encoding to the Unicode decode process.
The string literal of the B "xxxx" has a bytes type, which represents part of the source code. So, when the content is a string, we also call it a string that is consistent with the source code encoding.

Customary law:
Python2 uses string to express the concept of a string and use UTF8 encoding. When Unicode string and string are mixed in Python2, the string is considered to be source code encoded and decode as a Unicode string. Thus there is a trap, the type of the variable used is out of control, does not know whether it is a string or a Unicode string, so pay special attention here. If you use string to express the concept of a string, while using other encodings, it is also possible to see the encoding corresponding to the character set and the application is not well combined.

Python3 is a string that expresses the concept of strings and does not care about coding problems. Python3 There is no mixing problem, bytes and string are used together, it will be an error.


3.urlopen (XXX). Processing after read ():
It is clear that the concept of return here is python3.bytes.
If it is determined that the returned object is a webpage of text, we can call decode (encoding= ' page encoding ', errors= ' ignore ') to get the corresponding string.

However, note that the Python2 convention in 2, our string type is python2.string, so in Python2 we also have a encode process: encode (encoding= ' source code ', errors= ' ignore ‘)


4. Write the data file:
Writing here is a special file that can be viewed as a Python code and executed, so:

Customary law:
The file encoding uses UTF8.

In the Python2
With open (file, ' WB ') as Tempf:
Tempf.write (data)
With open (file, ' W ') as Tempf:
Tempf.write (data)
Is possible, because string cannot distinguish between a byte or a string.

In Python3, you use:
With open (file, ' WB ') as Tempf:
Tempf.write (Data.encode (encoding= ' UTF8 ', errors= ' ignore '))
Or
With open (file, ' W ') as Tempf:
Tempf.write (data)
Because the former writes bytes and the latter is a string.

The feasibility is based on the fact that we agreed that both the data file and the source code are UTF8, and without these conventions, we look at the semantics of the code:

Python2 Two code semantics is a bit confusing:
If the data is known to be a string, then the first code is theoretically wrong, but in fact two copies of the code have completed the goal of writing data to a file, a file encoding, and a database consistent.
If the data is known to be a binary stream, then the second code is theoretically wrong, but in fact two copies of the code are done, and there is no coding problem with the data file.

The semantics of the first code in Python3 is to write data to the string, which is encoded as UTF8.
The second code semantics is: Write data to the string, the file encoding and the current source file encoding (in this is considered the default encoding equals the current source file encoding) consistent.


5. Read the data file:
Customary law:
In the Python2
With open (file, ' RB ') as Tempf:
Tempf.read ()
And
With open (file, ' R ') as Tempf:
Tempf.read ()
Can, because the former returns some bytes, but uses a string as a container. The latter returns a string.

In Python3, you use:
With open (file, ' RB ') as Tempf:
Tempf.read (). Decode (encoding= ' UTF8 ', errors= ' ignore ')
Or
With open (file, ' R ') as Tempf:
Tempf.read ()
Because the open mode is different, the former returns bytes, which returns a string (attempting bytes internally to decode with the source file encoding).

The same is true of the code, which is based on a number of conventions, and if not, to see its semantics:

The semantics of the first code in Python2 is to read back some of the binary data.
The semantics of the second code in Python2 is to read in some strings. (theoretically there is a process that reads the binary again according to the current default encoding decode and then to the current default encoding encode, but in this case the two transformations are identical, so there is no action.) Obviously, in the case of illegal characters, two transformations are not identical. )

The semantics of the first code in Python3 is to read the binary data, which is decode according to UTF8.
The semantics of the second code in Python3 is implicitly read into the binary data and decode according to the current default encoding.

In my application scenario, the data read back has an eval process. Obviously, Python2 will not be able to encode errors until eval, and Python3 can detect errors earlier.

Python character encoding idiomatic method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.