Python and coding

Source: Internet
Author: User
Python and coding

Text objects in Python

In Python 3. x, str, bytes, and bytearray are used to process text.

Bytes and bytearray can be used in addition to formatting (format, format_map) and several special Unicode-based methods (casefold, isdecimal, isidentifier, isnumeric, isPRintable, encode) almost all other str methods.

Bytes has a class method that can be used to construct strings through sequences. this method cannot be used on str.

>>> B = bytes. fromhex ('e4 B8 ad ')
>>> B
B '\ xe4 \ xb8 \ xad'
>>> B. decode ('utf-8 ')
Medium'
>>> Str (B)
"B '\ xe4 \ xb8 \ xad '"

Unicode and character conversion

Using chr, you can convert a Unicode code point to a character. using ord, you can perform reverse operations.

>>> Ord ('A ')
65
>>> Ord ('zhong ')
20013
>>> Chr (65)
'A'
>>> Chr (1, 20013)
Medium'

The len function calculates the number of characters, not the number of bytes.

>>> Len ('zhong ')
1
>>> 'Center '. encode ('utf-8 ')
B '\ xe4 \ xb8 \ xad'
>>> Len ('中'. encode ('utf-8') # calculates the length of the bytes object, which contains three integer characters
3

Python and coding

Python internal encoding method

When Python accepts our input, it is always converted to Unicode first. The earlier the process, the better.
Then, Python always performs Unicode processing. in this process, do not perform transcoding.
When Python returns results to us, it will always convert from Unicode to the encoding we need. The later the process, the better.

Python source code encoding method

Python uses UTF-8 encoding by default.
If you want to use a different encoding method to save Python code, you can use the first or second line of each file (if the first line is occupied by the hash-bang command) encoding declaration)
# ‐*-Coding: windows ‐1252 ‐*‐

Encoding used in Python

C: \ Users \ JL> chcp # Find the encoding used by the operating system
Active code page: 936
>>> Import sys, locale
>>> Locale. getpreferredencoding () # This is the most important
'Cp936'
>>> My_file = open('cafe.txt ', 'r ')
>>> Type (my_file)

>>> My_file.encoding # By default, locale. getpreferreddecoding () is used for file objects.
'Cp936'
>>> Sys. stdout. isatty (), sys. stdin. isatty (), sys. stderr. isatty () # whether output is the console
(True, True, True)
>>> Sys. stdout. encoding, sys. stdin. encoding, sys. stderr. if the standard control flow of encoding # sys is redirected or directed to a file, the encoding will use the value of the environment variable PYTHONIOENCODING, the console encoding, or locale. the encoding of getpreferredencoding (). The priority is decreased in turn.
('Cp936', 'cp936', 'cp936 ')
>>> Sys. getdefaultencoding () # If Python needs to convert binary data into character objects, this value is used by default.
'Utf-8'
>>> Sys. getfilesystemencoding () # This encoding is used by default when Python is used to encode or decode the file name (not the file content.
'Mbcs'

The above is the test result in Windows, if in GNU/linux or OSX, then all the results are UTF-8.
For the difference between mbcs and UTF-8, refer to the http://stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows

File read/write encoding

>>> Pen('cafe.txt ', 'W', encoding = 'utf-8'). write ('caé ')
4
>>> Fp = open('cafe.txt ', 'r ')
>>> Fp. read ()
'Caf Ma'
>>> Fp. encoding
'Cp936'
>>> Open('cafe.txt ', 'R', encoding = 'cp936'). read ()
'Caf Ma'
>>> Open('cafe.txt ', 'R', encoding = 'latin1'). read ()
'Caf Shanghai'
>>> Fp = open('cafe.txt ', 'R', encoding = 'utf-8 ')
>>> Fp. encoding
'Utf-8'

From the above example, we can see that no default encoding is used at any time, because unexpected problems may occur when running on different machines.

How does Python handle Unicode problems?

Python always uses code point to compare the string size or whether it is equal.

In Unicode, there are two ways to indicate the accent, represented in one byte, or with a base letter plus an accent. they are equal in Unicode, however, in Python, because code point is used to compare the size, it is not equal.

>>> C1 = 'Cafe \ u0301'
>>> C2 = 'CA'
>>> C1 = c2
False
>>> Len (c1), len (c2)
(5, 4)

The solution is to use the normalize function in the unicodedata Library. The first parameter of this function can be one of four parameters: "NFC", "NFD", "NFKC", and "NFKD.
NFC (Normalization Form Canonical Composition): decomposed in the standard equivalent mode, and then reorganized in the standard equivalent mode. If singleton is used, the reorganization result may be different from that before decomposition. As much as possible to shorten the length of the entire string, so the 'E \ u0301 'two bytes are compressed to one byte 'é '.
NFD (Normalization Form Canonical Decomposition): decomposed in a standard equivalent method
NFKD (Normalization Form Compatibility Decomposition): decomposed in Compatibility equivalence mode
NFKC (Normalization Form Compatibility Composition): it is decomposed in compatible equivalence mode and then restructured in standard equivalence mode.
NFKC and NFKD may cause data loss.

From unicodedata import normalize
>>> C3 = normalize ('NFC ', c1) # perform operations in the direction of C1.
>>> Len (c3)
4
>>> C3 = c2
True
>>> C4 = normalize ('nfd ', c2)
>>> Len (c4)
5
>>> C4 = c1
True

Western keyboards usually type strings as short as possible, that is, the result is the same as that of "NFC", but the "NFC" operation is used to compare whether the strings are equal and safe. The W3C recommends NFC results.

The same character has two different encodings in Unicode.
This function converts a single Unicode character to another Unicode character.

>>> O1 = '\ u2126'
>>> O2 = '\ u03a9'
>>> O1, o2
('Hour', 'Ω ')
>>> O1 = o2
False
>>> Name (o1), name (o2)
('Ohm sign', 'Greek capital letter omega ')
>>> O3 = normalize ('NFC ', o1)
>>> Name (o3)
'Greek capital letter omega'
>>> O3 = o2
True

For example

>>> U1 = '\ u00b5'
>>> U2 = '\ u03bc'
>>> U1, u2
('Micro', 'μ ')
>>> Name (u1), name (u2)
('Micro sign', 'Greek small letter mu ')
>>> U3 = normalize ('nfkd ', u1)
>>> Name (u3)
'Greek small letter mu'

Another example

>>> H1 = '\ u00bd'
>>> H2 = normalize ('nfkc ', h1)
>>> H1, h2
('Region', '1 region2 ')
>>> Len (h1), len (h2)
(1, 3)

Sometimes we want to make a case-insensitive comparison.
Use str. casefold (), which converts uppercase letters to lowercase for comparison. for example, 'A' is converted to 'A ', 'micro' of 'micro sign' is converted to 'micro' of 'Greek small letter mu'
In most cases (98.9%), str. casefold () and str. lower () have the same results.

Text sorting
Due to different language rules, if you simply compare the code point method in Python, many results may not be expected.
Generally, locale. strxfrm is used for sorting.

>>> Import locale
>>> Locale. setlocale (locale. LC_COLLATE, 'pt _ BR.UTF-8 ')
'Pt _ BR.UTF-8'
>>> Sort_result = sorted (intial, key = locale. strxfrm)

Encoding and decoding error

If a decoding error occurs in the Python source code, a SyntaxError error occurs.
In other cases, if an encoding or decoding error is found, UnicodeEncodeError and UnicodeDecodeError may occur.

Several useful methods from fluent python

From unicodedata import normalize, combining
Def nfc_equal (s1, s2 ):
'''Return True if string s1 is eual to string s2 after normalization under "NFC "'''
Return normalize ("NFC", s1) = normalize ("NFC", s2)

Def fold_equal (s1, s2 ):
'''Return True if string s1 is eual to string s2 after normalization under "NFC" and casefold ()'''
Return normalize ('NFC ', s1). casefold () = normalize ('NFC', s2). casefold ()

Def shave_marks (txt ):
'''Remove all diacritic marks
Basically it only need to change Latin text to pure ASCII, but this func will change Greek letters also
Below shave_latin_marks func is more precise '''

Normal_txt = normalize ('nfd ', txt)
Shaved = ''. join (c for c in normal_txt if not combining (c ))
Return normalize ('NFC ', shaved)

Def shave_latin_marks (txt ):
'''Remove all diacritic marks from Latin base characters '''
Normal_txt = normalize ('nfd ', txt)
Keeping = []
Latin_base = False
For c in normal_txt:
If combining (c) and latin_base:
Continue # Ingore diacritic marks on Latin base char
Keeping. append (c)
# If it's not combining char, it shoshould be a new base char
If not combining (c ):
Latin_base = c in string. ascii_letters

Encoding sniffing Chardet

This is a standard Python module.

References:

Http://blog.csdn.net/tcdddd/article/details/8191464

The above is python and encoding content. for more related articles, please follow the PHP Chinese network (www.php1.cn )!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.