Tag: bit standard python cap means reading lock cape Tin
Unicode what is Unicode
Standard Unicode
Standard Unicode provides a unique number for each character, and is common across platforms, devices, applications, or programming languages. --From http://unicode.org/standard/WhatIsUnicode.html
Unicode what is Unicode
Standard Unicode
Standard Unicode provides a unique number for each character, and is common across platforms, devices, applications, or programming languages. --From http://unicode.org/standard/WhatIsUnicode.html
Encoding prior to Unicode
such as ASCII, GBK and so on.
These early character encodings are restricted and cannot contain encodings that contain all the languages of the world.
Early character encodings also conflict with one another. Both encodings may use the same numbers to represent different characters or to use different numbers to represent the same characters. Any given computer (especially a server) will need to support a number of different encodings. However, when data is passed between different computers or different encodings, the data is at risk of conflict. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
UTF
UTF (Unicode Transformation format) means the Unicode conversion formats.
For example, if a Unicode file contains only basic 7-bit ASCII characters, if each character is transmitted using a 2-byte original Unicode encoding, the 8 bits of its first byte are always 0. This creates a relatively large waste. In this case, you can use the UTF-8 encoding, which is a variable-length encoding that will still represent the basic 7-bit ASCII character in 7-bit encoding, occupying one byte (the first 0). In the case of mixing with other Unicode characters, a certain algorithm is converted, each character is encoded with 1-3 bytes and identified by the first 0 or 1. This significantly saves the encoding length for western documents that are dominated by 7-bit ASCII characters (see UTF-8 for a specific scenario). Similarly, the 2-byte encoded UTF-16 also needs to be transformed by a certain algorithm for future occurrences requiring 4-byte auxiliary plane characters and other UCS-4 curried characters. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
Python encoding
Unicode is a series of numbers.
Python encoding refers to converting Unicode to bytes. --From Https://docs.python.org/3/howto/unicode.html#encodings
For ASCII encoding:
- If the encoding point is less than 128, each bit is the same as the value of the coded point
- If the encoding point is greater than or equal to 128, these Unicode characters cannot be represented with this encoding. (Python throws Unicodeencodeerror)
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
UTF-8 is the most commonly used encoding and has the following convenient properties:
- All Unicode encoding points can be processed.
- ASCII text is also a valid UTF-8 text.
- The UTF-8 is very compact; commonly used characters can be represented by one or two bytes.
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
Python3 Support for Unicode
Starting with Python 3.0, use Unicode to store strings.
The default encoding for Python source code is UTF-8, or it can be specified by a # -*- coding: <encoding name> -*-
specific encoding.
Read and write Unicode data
Unicode data is typically converted to an encoding before it is written to disk or sent to a socket. You can do all the work yourself: Open a file, read 8-bit bytes from the file, and then use the bytes.decode(encoding)
transform bytes. However, manual processing is not recommended.
One reason is that a Unicode character can be represented by more than one bytes. If you read blocks of any size (for example, 1024 or 4096 bytes), you need to write error handling code to catch the case where the Unicode characters are incomplete at the end of the block. One solution is to read the entire file into memory, but this will make it impossible for you to handle large files.
The workaround is to use a low-level decoding interface to capture a partial encoding sequence. This work has been implemented by the open()
function itself, open(filename, encoding=encoding)
returning a File-like object that can have methods such as read()
and write()
.
--The above quote from Https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data
The techniques of Unicode in programming
The software should use only Unicode strings, decode the input data (bytes) as soon as possible and encode the output only at the end.
When using data from a browser or other untrusted source, a common technique is to check for illegal characters in the string before using the string as the command line or storing the string to the database. If you are going to do this, be careful to check the decoded string instead of the encoded bytes data, because some encodings may have some interesting properties, such as having multiple meanings or not fully matching ASCII. --From Https://docs.python.org/3/howto/unicode.html#tips-for-writing-unicode-aware-programs
Unknown encoded file
If you know that the encoding of the file is appropriate for ASCII and only want to test or modify the ASCII portion, you can open the file with the surrogateescape
wrong processor.
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f: data = f.read()# make changes to the string 'data'with open(fname + '.new', 'w', encoding="ascii", errors="surrogateescape") as f: f.write(data)
surrogateescape
The error processor decodes all non-ASCII bytes to Unicode encoding points. These secret code points will change back to the same bytes when using surrogateescape
encoded data and writing it out.
--From Https://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding
Assuming that the file has only one encoding, then you can try using all the standard encoding to decode, from decoding the result of no error to select the appropriate, that is, no garbled results.
Encoding prior to Unicode
such as ASCII, GBK and so on.
These early character encodings are restricted and cannot contain encodings that contain all the languages of the world.
Early character encodings also conflict with one another. Both encodings may use the same numbers to represent different characters or to use different numbers to represent the same characters. Any given computer (especially a server) will need to support a number of different encodings. However, when data is passed between different computers or different encodings, the data is at risk of conflict. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
UTF
UTF (Unicode Transformation format) means the Unicode conversion formats.
For example, if a Unicode file contains only basic 7-bit ASCII characters, if each character is transmitted using a 2-byte original Unicode encoding, the 8 bits of its first byte are always 0. This creates a relatively large waste. In this case, you can use the UTF-8 encoding, which is a variable-length encoding that will still represent the basic 7-bit ASCII character in 7-bit encoding, occupying one byte (the first 0). In the case of mixing with other Unicode characters, a certain algorithm is converted, each character is encoded with 1-3 bytes and identified by the first 0 or 1. This significantly saves the encoding length for western documents that are dominated by 7-bit ASCII characters (see UTF-8 for a specific scenario). Similarly, the 2-byte encoded UTF-16 also needs to be transformed by a certain algorithm for future occurrences requiring 4-byte auxiliary plane characters and other UCS-4 curried characters. --From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
Python encoding
Unicode is a series of numbers.
Python encoding refers to converting Unicode to bytes. --From Https://docs.python.org/3/howto/unicode.html#encodings
For ASCII encoding:
- If the encoding point is less than 128, each bit is the same as the value of the coded point
- If the encoding point is greater than or equal to 128, these Unicode characters cannot be represented with this encoding. (Python throws Unicodeencodeerror)
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
UTF-8 is the most commonly used encoding and has the following convenient properties:
- All Unicode encoding points can be processed.
- ASCII text is also a valid UTF-8 text.
The UTF-8 is very compact; commonly used characters can be represented by one or two bytes.
--From Https://zh.wikipedia.org/wiki/Unicode#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F
Python3 Support for Unicode
Starting with Python 3.0, use Unicode to store strings.
The default encoding for Python source code is UTF-8, or it can be specified by a # -*- coding: <encoding name> -*-
specific encoding.
Read and write Unicode data
Unicode data is typically converted to an encoding before it is written to disk or sent to a socket. You can do all the work yourself: Open a file, read 8-bit bytes from the file, and then use the bytes.decode(encoding)
transform bytes. However, manual processing is not recommended.
One reason is that a Unicode character can be represented by more than one bytes. If you are reading blocks of any size (such as 1024 or 4096 bytes), you need to write error handling code to catch the case where the Unicode characters are not complete at the end of the block. One solution is to read the entire file into memory, but this will make it impossible for you to handle large files.
The workaround is to use a low-level decoding interface to capture a partial encoding sequence. This work has been implemented by the open()
function itself, open(filename, encoding=encoding)
returning a File-like object that can have methods such as read()
and write()
.
--The above quote from Https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data
Tips for writing programs that pay attention to Unicode
The software should use only Unicode strings, decode the input data (bytes) as soon as possible and encode the output only at the end.
When using data from a browser or other untrusted source, a common technique is to check for illegal characters in the string before using the string as the command line or storing the string to the database. If you are going to do this, be careful to check the decoded string instead of the encoded bytes data, because some encodings may have some interesting properties, such as having multiple meanings or not fully matching ASCII. --From Https://docs.python.org/3/howto/unicode.html#tips-for-writing-unicode-aware-programs
Unknown encoded file
If you know that the encoding of the file is appropriate for ASCII and only want to test or modify the ASCII portion, you can open the file with the surrogateescape
wrong processor.
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f: data = f.read()# make changes to the string 'data'with open(fname + '.new', 'w', encoding="ascii", errors="surrogateescape") as f: f.write(data)
surrogateescape
The error processor decodes all non-ASCII bytes to Unicode encoding points. These secret code points will change back to the same bytes when using surrogateescape
encoded data and writing it out.
--From Https://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding
Assuming that the file has only one encoding, you can try to decode it using all the standard encodings and pick the appropriate ones from the results of decoding without errors. The appropriate results refer to the results without garbled characters.
Python code (i)-python3