Python code (i)-python3

Last Update:2018-01-25 Source: Internet

Author: User

Tags compact

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tag: bit standard python cap means reading lock cape Tin

Unicode what is Unicode

Standard Unicode

Standard Unicode provides a unique number for each character, and is common across platforms, devices, applications, or programming languages. --From http://unicode.org/standard/WhatIsUnicode.html

Unicode what is Unicode

Standard Unicode

Standard Unicode provides a unique number for each character, and is common across platforms, devices, applications, or programming languages. --From http://unicode.org/standard/WhatIsUnicode.html

Encoding prior to Unicode

such as ASCII, GBK and so on.

These early character encodings are restricted and cannot contain encodings that contain all the languages of the world.

Early character encodings also conflict with one another. Both encodings may use the same numbers to represent different characters or to use different numbers to represent the same characters. Any given computer (especially a server) will need to support a number of different encodings. However, when data is passed between different computers or different encodings, the data is at risk of conflict. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

UTF

UTF (Unicode Transformation format) means the Unicode conversion formats.

For example, if a Unicode file contains only basic 7-bit ASCII characters, if each character is transmitted using a 2-byte original Unicode encoding, the 8 bits of its first byte are always 0. This creates a relatively large waste. In this case, you can use the UTF-8 encoding, which is a variable-length encoding that will still represent the basic 7-bit ASCII character in 7-bit encoding, occupying one byte (the first 0). In the case of mixing with other Unicode characters, a certain algorithm is converted, each character is encoded with 1-3 bytes and identified by the first 0 or 1. This significantly saves the encoding length for western documents that are dominated by 7-bit ASCII characters (see UTF-8 for a specific scenario). Similarly, the 2-byte encoded UTF-16 also needs to be transformed by a certain algorithm for future occurrences requiring 4-byte auxiliary plane characters and other UCS-4 curried characters. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

Python encoding

Unicode is a series of numbers.

Python encoding refers to converting Unicode to bytes. --From Https://docs.python.org/3/howto/unicode.html#encodings

For ASCII encoding:

If the encoding point is less than 128, each bit is the same as the value of the coded point

If the encoding point is greater than or equal to 128, these Unicode characters cannot be represented with this encoding. (Python throws Unicodeencodeerror)
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

UTF-8 is the most commonly used encoding and has the following convenient properties:

All Unicode encoding points can be processed.

ASCII text is also a valid UTF-8 text.

The UTF-8 is very compact; commonly used characters can be represented by one or two bytes.
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

Python3 Support for Unicode

Starting with Python 3.0, use Unicode to store strings.

The default encoding for Python source code is UTF-8, or it can be specified by a # -*- coding: <encoding name> -*- specific encoding.

Read and write Unicode data

Unicode data is typically converted to an encoding before it is written to disk or sent to a socket. You can do all the work yourself: Open a file, read 8-bit bytes from the file, and then use the bytes.decode(encoding) transform bytes. However, manual processing is not recommended.

One reason is that a Unicode character can be represented by more than one bytes. If you read blocks of any size (for example, 1024 or 4096 bytes), you need to write error handling code to catch the case where the Unicode characters are incomplete at the end of the block. One solution is to read the entire file into memory, but this will make it impossible for you to handle large files.

The workaround is to use a low-level decoding interface to capture a partial encoding sequence. This work has been implemented by the open() function itself, open(filename, encoding=encoding) returning a File-like object that can have methods such as read() and write() .
--The above quote from Https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data

The techniques of Unicode in programming

The software should use only Unicode strings, decode the input data (bytes) as soon as possible and encode the output only at the end.

When using data from a browser or other untrusted source, a common technique is to check for illegal characters in the string before using the string as the command line or storing the string to the database. If you are going to do this, be careful to check the decoded string instead of the encoded bytes data, because some encodings may have some interesting properties, such as having multiple meanings or not fully matching ASCII. --From Https://docs.python.org/3/howto/unicode.html#tips-for-writing-unicode-aware-programs

Unknown encoded file

If you know that the encoding of the file is appropriate for ASCII and only want to test or modify the ASCII portion, you can open the file with the surrogateescape wrong processor.

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:    data = f.read()# make changes to the string 'data'with open(fname + '.new', 'w',          encoding="ascii", errors="surrogateescape") as f:    f.write(data)

surrogateescapeThe error processor decodes all non-ASCII bytes to Unicode encoding points. These secret code points will change back to the same bytes when using surrogateescape encoded data and writing it out.
--From Https://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding

Assuming that the file has only one encoding, then you can try using all the standard encoding to decode, from decoding the result of no error to select the appropriate, that is, no garbled results.
Encoding prior to Unicode

such as ASCII, GBK and so on.

These early character encodings are restricted and cannot contain encodings that contain all the languages of the world.

Early character encodings also conflict with one another. Both encodings may use the same numbers to represent different characters or to use different numbers to represent the same characters. Any given computer (especially a server) will need to support a number of different encodings. However, when data is passed between different computers or different encodings, the data is at risk of conflict. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

UTF

UTF (Unicode Transformation format) means the Unicode conversion formats.

For example, if a Unicode file contains only basic 7-bit ASCII characters, if each character is transmitted using a 2-byte original Unicode encoding, the 8 bits of its first byte are always 0. This creates a relatively large waste. In this case, you can use the UTF-8 encoding, which is a variable-length encoding that will still represent the basic 7-bit ASCII character in 7-bit encoding, occupying one byte (the first 0). In the case of mixing with other Unicode characters, a certain algorithm is converted, each character is encoded with 1-3 bytes and identified by the first 0 or 1. This significantly saves the encoding length for western documents that are dominated by 7-bit ASCII characters (see UTF-8 for a specific scenario). Similarly, the 2-byte encoded UTF-16 also needs to be transformed by a certain algorithm for future occurrences requiring 4-byte auxiliary plane characters and other UCS-4 curried characters. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

Python encoding

Unicode is a series of numbers.

Python encoding refers to converting Unicode to bytes. --From Https://docs.python.org/3/howto/unicode.html#encodings

For ASCII encoding:

If the encoding point is less than 128, each bit is the same as the value of the coded point

If the encoding point is greater than or equal to 128, these Unicode characters cannot be represented with this encoding. (Python throws Unicodeencodeerror)
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

UTF-8 is the most commonly used encoding and has the following convenient properties:

All Unicode encoding points can be processed.

ASCII text is also a valid UTF-8 text.
The UTF-8 is very compact; commonly used characters can be represented by one or two bytes.
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F

Python3 Support for Unicode

Starting with Python 3.0, use Unicode to store strings.

The default encoding for Python source code is UTF-8, or it can be specified by a # -*- coding: <encoding name> -*- specific encoding.

Read and write Unicode data

Unicode data is typically converted to an encoding before it is written to disk or sent to a socket. You can do all the work yourself: Open a file, read 8-bit bytes from the file, and then use the bytes.decode(encoding) transform bytes. However, manual processing is not recommended.

One reason is that a Unicode character can be represented by more than one bytes. If you are reading blocks of any size (such as 1024 or 4096 bytes), you need to write error handling code to catch the case where the Unicode characters are not complete at the end of the block. One solution is to read the entire file into memory, but this will make it impossible for you to handle large files.

The workaround is to use a low-level decoding interface to capture a partial encoding sequence. This work has been implemented by the open() function itself, open(filename, encoding=encoding) returning a File-like object that can have methods such as read() and write() .
--The above quote from Https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data

Tips for writing programs that pay attention to Unicode

The software should use only Unicode strings, decode the input data (bytes) as soon as possible and encode the output only at the end.

When using data from a browser or other untrusted source, a common technique is to check for illegal characters in the string before using the string as the command line or storing the string to the database. If you are going to do this, be careful to check the decoded string instead of the encoded bytes data, because some encodings may have some interesting properties, such as having multiple meanings or not fully matching ASCII. --From Https://docs.python.org/3/howto/unicode.html#tips-for-writing-unicode-aware-programs

Unknown encoded file

If you know that the encoding of the file is appropriate for ASCII and only want to test or modify the ASCII portion, you can open the file with the surrogateescape wrong processor.

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:    data = f.read()# make changes to the string 'data'with open(fname + '.new', 'w',          encoding="ascii", errors="surrogateescape") as f:    f.write(data)

surrogateescapeThe error processor decodes all non-ASCII bytes to Unicode encoding points. These secret code points will change back to the same bytes when using surrogateescape encoded data and writing it out.
--From Https://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding

Assuming that the file has only one encoding, you can try to decode it using all the standard encodings and pick the appropriate ones from the results of decoding without errors. The appropriate results refer to the results without garbled characters.

Python code (i)-python3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More