Python and encoding

Source: Internet
Author: User
Tags locale ord stdin

Text objects in Python

The objects that work with text in Python 3.x are str, bytes, ByteArray.

Bytes and ByteArray can be used in addition to methods used for formatting (format, Format_map) and several special Unicode-based methods (Casefold, Isdecimal, Isidentifier, IsNumeric, Isprintable, encode) outside of nearly all Str's methods.

Bytes has a class method that can be used to construct a string, and this method is not available on Str.

>>> B = Bytes.fromhex (' E4 B8 AD ')
>>> b
B ' \xe4\xb8\xad '
>>> b.decode (' Utf-8 ')
In
>>> Str (b)
"B ' \\xe4\\xb8\\xad '"

Unicode and character conversions

With CHR, you can convert a Unicode code point to a character, which can be reversed by Ord.

>>> Ord (' A ')
65
>>> Ord (' Middle ')
20013
>>> Chr (65)
A
>>> chr (20013)
In

The Len function calculates the number of characters, not the number of bytes

>>> len (' Medium ')
1
>>> ' in '. Encode (' Utf-8 ')
B ' \xe4\xb8\xad '
>>> len (' Middle '. Encode (' Utf-8 ')) #计算的是bytes对象的长度 with 3 integer characters
3

Python and encoding

How Python internally processes the encoding

When Python accepts our input, it is always converted to Unicode first. And the sooner this process, the better.
Python's processing is then always done for Unicode, and in this process, it is important not to encode the conversion.
When Python returns results to us, it always shifts from Unicode to the encoding we need. And the process is as late as possible.

How to encode Python source code

Python uses UTF-8 encoding by default.
If you want to save Python code in a different way, we can place the encoding Declaration (Encoding Declaration) on the first or second line of each file (if the first line is occupied by the Hash-bang command)
#‐*‐coding:windows‐1252‐*‐

The encoding used in Python

C:\users\jl>chcp #查找操作系统使用的编码
Active code page:936
>>> Import sys, locale
>>> Locale.getpreferredencoding () #这个是最重要的
' cp936 '
>>> my_file = open (' Cafe.txt ', ' R ')
>>> Type (my_file)
<class ' _io. Textiowrapper '
>>> my_file.encoding #文件对象默认使用locale. getpreferreddecoding () value
' cp936 '
> >> Sys.stdout.isatty (), Sys.stdin.isatty (), Sys.stderr.isatty () #output是否是控制台console
(True, True, True)
>>> sys.stdout.encoding, sys.stdin.encoding, Sys.stderr.encoding #sys的标准控制流如果被重定向, or directed to a file, The encoding will then use the value of the environment variable pythonioencoding, the encoding of console console, or the encoding of locale.getpreferredencoding (), which in turn is reduced in priority order.
(' cp936 ', ' cp936 ', ' cp936 ')
>>> sys.getdefaultencoding () #如果Python需要把二进制数据转为字符对象, the value is used by default. This encoding is used by default when
' Utf-8 '
>>> sys.getfilesystemencoding () #Python用来编码或者解码文件名 (not file content).
' MBCS '

The above is the test results in Windows, if in Gnu/linux or OSX, then all the results are UTF-8.
For the difference between MBCS and Utf-8, you can refer to Http://stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows

Encoding of file read and write

>>> pen (' Cafe.txt ', ' W ', encoding= ' utf-8 '). Write (' Café ')
4
>>> fp = open (' Cafe.txt ', ' R ')
>>> Fp.read ()
' Caf-Mao '
>>> fp.encoding
' cp936 '
>>> open (' Cafe.txt ', ' r ', encoding = ' cp936 '). Read ()
' Caf-Mao '
>>> open (' Cafe.txt ', ' r ', encoding = ' latin1 '). Read ()
' Cafã '
>>> fp = open (' Cafe.txt ', ' r ', encoding = ' utf-8 ')
>>> fp.encoding
' Utf-8 '

As you can see from the example above, you should not use the default encoding at any time, because there will be unexpected problems when running on different machines.

How Python handles the trouble from Unicode

Python always compares the size of a string by code point, or is equal.

There are two representations of accented characters in Unicode, expressed in one byte, or accented by a base letter, they are equal in Unicode, but are not equal in Python because they are compared by code point.

>>> C1 = ' cafe\u0301 '
>>> c2 = ' Café '
>>> C1 = = C2
False
>>> Len (C1), Len (C2)
(5, 4)

The workaround is to use the normalize function in the Unicodedata library, the first parameter of the function can accept "NFC", ' NFD ', ' nfkc ', ' nfkd ' one of the four parameters.
NFC (Normalization Form Canonical composition): Decomposed in the standard equivalence mode and then reorganized in standard equivalence. If singleton, the result of recombination may be different from that before decomposition. Shorten the length of the entire string as much as possible, so the ' e\u0301 ' 2 bytes are compressed to a byte ' é '.
NFD (Normalization Form Canonical decomposition): Decomposition in a standard equivalent way
NFKD (normalization Form compatibility decomposition): Decomposition in a compatible equivalent way
NFKC (Normalization Form compatibility composition): Decomposed by a compatible equivalence method and then reorganized by standard equivalence.
NFKC and NFKD may cause data loss.

From Unicodedata Import normalize
>>> C3 = Normalize (' NFC ', C1) #把c1往字符串长度缩短的方向操作
>>> Len (C3)
4
>>> C3 = = C2
True
>>> C4 = normalize (' NFD ', C2)
>>> Len (C4)
5
>>> C4 = = C1
True

Western keyboards typically type as short a string as possible, meaning that it is consistent with the "NFC" result, but it is safer to manipulate it by "NFC" to compare strings for equality. The results of the use of "NFC" are recommended.

The same character has two different encodings in Unicode.
This function will convert a single Unicode character to another Unicode character.

>>> O1 = ' \u2126 '
>>> O2 = ' \u03a9 '
>>> O1, O2
(' ω ', ' ω ')
>>> O1 = = O2
False
>>> name (O1), name (O2)
(' OHM sign ', ' GREEK capital Letter OMEGA ')
>>> O3 = normalize (' NFC ', O1)
>>> Name (O3)
' GREEK Capital Letter OMEGA '
>>> O3 = = O2
True

Another example

>>> u1 = ' \u00b5 '
>>> u2 = ' \U03BC '
>>> U1,U2
(' Μ ', ' μ ')
>>> name (U1), name (U2)
(' MICRO sign ', ' GREEK SMALL letter MU ')
>>> U3 = normalize (' nfkd ', U1)
>>> name (U3)
' GREEK SMALL letter MU '

One more example

>>> h1 = ' \U00BD '
>>> H2 = normalize (' NFKC ', H1)
>>> H1, H2
(' ½ ', ' 1⁄2 ')
>>> Len (H1), Len (H2)
(1, 3)

Sometimes we want to compare it in a case-insensitive way.
Using the method Str.casefold (), the method will convert uppercase letters to lowercase for comparison, such as ' a ' will be converted to ' a ', ' MICRO sign ' ' Μ ' will convert to ' GREEK SMALL letter MU ' Μ '
In most cases (98.9%) the results of Str.casefold () and Str.lower () were consistent.

Sort text
Because of the different language rules, if you simply follow Python's comparison of code point, there are a number of results that are not expected by the user.
They are usually sorted by locale.strxfrm.

>>> import Locale
>>> Locale.setlocale (locale. Lc_collate, ' Pt_br. UTF-8 ')
' Pt_br. UTF-8 '
>>> Sort_result = sorted (intial, key = Locale.strxfrm)

Encoding decoding error

If there is a decoding error in the Python source code, then a SyntaxError exception is generated.
In other cases, if a codec error is found, then a unicodeencodeerror, Unicodedecodeerror exception is generated.

Several useful ways to extract from fluent python

From Unicodedata import normalize, combining
def nfc_equal (S1, S2):
"Return True if string S1 is eual to string S2 after normalization under" NFC "
Return normalize ("NFC", s1) = = Normalize ("NFC", S2)

def fold_equal (S1, S2):
"Return True if string S1 is eual to string S2 after normalization under" NFC "and Casefold ()"
Return normalize (' NFC ', S1). Casefold () = = Normalize (' NFC ', s2). Casefold ()

def shave_marks (TXT):
"Remove all diacritic marks
Basically it only need to change Latin text to pure ASCII, but this func would change Greek letters also
Below Shave_latin_marks func is more precise "'

Normal_txt = Normalize (' NFD ', txt)
shaved = '. Join (c for C in Normal_txt if not combining (c))
Return normalize (' NFC ', shaved)

def shave_latin_marks (TXT):
"Remove all diacritic marks from Latin base characters"
Normal_txt = Normalize (' NFD ', txt)
keeping = []
Latin_base=false
For C in Normal_txt:
If combining (c) and Latin_base:
Continue #Ingore diacritic marks on Latin base Char
Keeping.append (c)
#If it ' s not combining char, it should is a new base char
If not combining (c):
Latin_base = C in String.ascii_letters

Code sniffing Chardet

This is the standard module for Python.

Resources:

http://blog.csdn.net/tcdddd/article/details/8191464

The above is python and encoded content, more related articles please pay attention to topic.alibabacloud.com (www.php.cn)!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.