How to define the file encoding of a Python source file

Source: Internet
Author: User

Brief introduction

This article is intended to describe how to define a Python source file encoding. The Python interpreter can parse the current file based on the encoding information specified. Typically, this approach improves the parser's recognition of Unicode-encoded source files and supports writing Unicode encodings, such as using UTF-8 in an editor that supports Unicode encoding.

Problem

In python2.1, Unicode encoding can only be achieved by means of "unicode-escape" in Latin-1. This makes many users who do not normally use Latin-1 coding feel very unfriendly, especially in most Asian countries. Programmers can use the encoding they are accustomed to to write 8-bit character codes, but have to use "Unicode-escape" for Unicode encoding.

Workaround

We want every Python file to achieve the visibility and maintainability of Python source code coding by writing some specific annotations at the top of the file.

In order for Python to recognize the definition of the file encoding, some columns are needed to augment the code data of the Python source file.

Defining the Encoding

Python uses ASCII as the standard encoding by default without specifying a different encoding.

If you want to define a file code encoding, a special comment should be placed on the first or second line of the source file, for example:

          # coding=<encoding Name>

or (using a method that most editors can recognize)

          #!/usr/bin/python          #-*-Coding: <encoding name>-*-

Or

          #!/usr/bin/python          # vim:set fileencoding=<encoding name>:

Exactly, the first or second row conforms exactly to the regular expression "coding[:=]\s* ([-\w.] +)". The first set of information that this expression obtains is the encoded name. If Python does not recognize the encoded name, it will be wrong at compile-time times. In the line defined by the Python encoding, other Python declarations are strictly prohibited.

To prevent some systems, such as windows, from adding a header tag to the Unicode file header, UTF-8 's signature "\XEF\XBB\XBF" also identifies the reference to the file encoding, even if no file encoding comment is set.

If a source file has both a UTF-8 file header tag and a file encoding declared with a comment, it can only be declared as "UTF-8" at this time. Other encodings will result in errors.

Example

The following example terminology shows how to define a Python source file encoding at the top of a python file using different methods.

File:

1. Using an interpreter and using Emacs-style file encoding

Comments:

          #!/usr/bin/python          #-*-coding:latin-1-*-          import os, sys          ...          #!/usr/bin/python          #-*-coding:iso-8859-15-*-          import os, sys          ...          #!/usr/bin/python          #-*-coding:ascii-*-          import os, sys          ...

2. Do not use an interpreter, use a text description

          # This Python file uses the following Encoding:utf-8          import OS, sys          ...

3. In a text editor, you can define the encoding of a file in different ways, for example:

          #!/usr/local/bin/python          # coding:latin-1          import OS, sys          ...

4. Without coding annotations, the Python interpreter treats the file as ASCII:

          #!/usr/local/bin/python          import OS, sys          ...

5. Code annotations that do not take effect:

The "Coding:" Prefix was lost:

          #!/usr/local/bin/python          # latin-1          import OS, sys          ...

The encoded comment is not in the first and second lines:

          #!/usr/local/bin/python          #          #-*-coding:latin-1-*-          import os, sys          ...

Unsupported encoding:

          #!/usr/local/bin/python          #-*-coding:utf-42-*-          import os, sys          ...
Idea

We are using the following concepts to implement the use of coded annotations:

1. A python source file should have a unique encoding. The behavior of mixing multiple encoded data internally is not allowed and will be error-enabled at compile time.

Any one can identify the first two lines of the source code, and in line with the above discussion of the code, can be used as source code files, including ASCII-compatible encoding and some multi-byte encoding, such as Shift_JIS. All characters using at least two bytes of encoding cannot be recognized, such as UTF-16. This is to make the detection function of the Code detection algorithm more concise.

2. For undefined information, the analysis should continue without processing, as is the current behavior. In fact, all available coded values are standard string characters (all 8-bit Unicode), and they are only a small part of other undefined information that might appear in the code.

3. Python's tokenizer/compiler component should be updated to the following workflow:

A) Read the file

b) decode the file into Unicode encoding, which is a fixed, hypothetical encoding

c) Convert the file to a UTF-8 byte string

d) content of Tokenize UTF-8

e) Compile, create Unicode objects based on the given Unicode data, and re-encode the UTF-8 data into new 8-bit string data based on the encoding given in the file.

Note that Python's identifiers are limited to ASCII-encoded collections, so no additional conversions are necessary after step d.

Backward compatible

In order to be compatible with existing, not using ASCII encoding and not declaring the encoding format, there are now 2 steps to use:

1. All files that do not use ASCII encoding and are not annotated are treated as missing "iso-8859-1" definitions. This results in forcing the processing of the byte string to precede step 2-5 and, in python2.2, to promote compatibility with non-ASCII character--unicode.

When the input file is found to have no ASCII bytes, a warning is generated when the input file is encoded.

2. Remove the warning and set the default encoding format to "ASCII".

Built-in compile () API to improve the processing power of input files for Unicode encoding. The processing of the input of a byte string is described earlier in this paper.

If a string with an encoded declaration is passed to compile (), a syntaxerror will occur.

SUZUKI Hisao can be compatible by using patches, see [2] for more information.

Only patches that can be used in the first step, in [1].

New developments

Steps 1 and 2 in backward compatibility have been refined in the 2.3 release, in addition to the default encoding of "ASCII".

In version 2.5, the default encoding was implemented as "ASCII".

Link
    [1] Phase 1 implementation:        http://python.org/sf/526840    [2] Phase 2 implementation:        HTTP://PYTHON.ORG/SF /534304
History
    1.10 and Above:see CVS    1.8:added '. ' to the coding RE.    1.7:added warnings to Phase 1 implementation. Replaced the         Latin-1 default encoding with the interpreter ' s default         encoding. Added tweaks to compile ().    1.4-1.6:minor tweaks    1.3:worked in comments by Martin v. Loewis:          UTF-8 BOM mark detection, Emacs style magic C Omment,         phase approach to the implementation
Copyright
    This document has a been placed in the public domain.

Source:https://hg.python.org/peps/file/tip/pep-0263.txt

How to define the file encoding of a Python source file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.