Python advanced (IV)-text and byte sequences (Encoding Problems), python bytes
Main content of this article
Character
Bytes
Structure and memory View
Conversion between characters and bytes-codecs
BOM ghost character
Continue tomorrow...
Python advanced-directory
The code in this article is on github: https://github.com/ampeeg/cnblogs/tree/master/pythonadvanced
Character
'''Character encoding is a problem that often plagued python programmers. I often encounter this headache During crawler writing. Starting from python3, the human language (text string) and machine language (Binary byte) are clearly distinguished. Before we start with a text string, we must define "character": character: unicode character. The element obtained from the str object in python3 is a Unicode character string. A string is a character sequence (which echoes the content in (1) '''if _ name _ = "_ main _": # create a character s1 = str ('A ') s2 = 'B' s3 = u'c' print (s1, s2, s3) # a B c
|
Remember that the character in python3 is unicode, that is, str is unicode, which is a language that humans can understand.
Bytes
'''Python3 has two built-in binary sequence types: immutable bytes, variable bytearray (1), and bytearray ~ Integer between 255 (8 bits) and (2) the slice of the binary sequence is always the same type of binary sequence ''' if _ name _ = "_ main __": # create bytes and bytearray b1 = bytes ('abc ', encoding = 'utf8') # About encode, I don't know if anyone, like me, always obfuscated the encoding and decoding directions. print (b1) # B 'abc \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd 'b2 = bytearray ('abc Hi ', encoding = 'utf8') print (b2) # bytearray (B 'abc \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd ') # Slice (Tip: All sequences can be sliced) print (b1 []) # B '\ xe4 \ xbd' print (b2 []) # bytearray (B '\ xe4 \ xbd') # use the list to retrieve Try the value method print (b1 [3]) #228 at this time, the obtained result is not a byte sequence, but an element for _ in b1: print (_, end = ', ') #99,228,189,160,229,165,189, which are all 8-bit integers # the variable of bytes. bytearray variable # b1 [3] = 160 # error: 'bytes 'object does not support item assignment print (id (b2), b2) #4373768376 bytearray (B 'abc \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd ') b2 [2] = 78 print (id (b2), b2) #4373768376 bytearray (B 'abn \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd ') # convert b2 to a string to see print (B 2. decode ('utf8') # abN hello # Note: utf8 can be converted to unicode here because the ascii code of N is consistent with that of utf8 b2.extend (bytearray ('added content ', encoding = 'utf8') # since it is a variable sequence, bytearray certainly has the general sequence method print (id (b2), b2) #4373768376 bytearray (B 'abn \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd \ xe6 \ xb7 \ xbb \ xe5 \ x8a \ xa0 \ xe7 \ x9a \ x84 \ xe5 \ x86 \ x85 \ xe5 \ xae \ xb9 ') print (b2.decode ('utf8') # abN what you want to add # PS: You can regard the binary sequence as a list, and the element is ascii code (0 ~ 255) |
Structure and memory View
'''Struct can extract structured information from binary sequences. The struct module provides functions to convert the packaged byte sequence into tuples composed of different types of fields. Other functions are used for reverse conversion. The struct module can process bytes, bytearray, and memoryview objects. '''Import structif _ name _ = "_ main _": # The memoryview class is used to share memory, you can access other binary sequences, packaged arrays, and data slices in the buffer. You do not need to assign a value to the byte sequence fmt = '<3s3shh' # set the format. <it is a small byte sequence, 3s3s is two 3-byte sequences, and HH is two 16-bit binary integers with open('l3_tu_python.jpg ', 'rb') as f: img = memoryview (f. read () print (bytes (img [: 10]) # B '\ xff \ xd8 \ xff \ xe0 \ x00 \ x10JFIF \ x00 \ x01 \ x01 \ x02 \ x00 \ x1c \ x00 \ x1c \ x00 \ x00' print (struct. unpack (fmt, img [: 10]) # (B '\ xff \ xd8 \ xff', B '\ xe0 \ x00 \ x10', 17994,179 93): unpack del img |
Conversion between characters and bytes-codecs
'''Python comes with over 100 middle-end codecs for mutual conversion between strings and bytes. Each encoding has multiple names, such as 'utf _ 8', 'utf8', 'utf-8', and 'u8', which can be passed to open (), str. encode (), bytes. the encoding parameter '''if _ name _ = "_ main _": # Check the encoding effect for codec in ['gbk ', 'utf8', 'utf16']: print (codec, "hello ". encode (codec), sep = '\ t ') '''gbk B '\ xc4 \ xe3 \ xba \ xc3 'utf8 B' \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd 'utf16 B '\ xff \ xfe' O} Y ''' # decompress print (B '\ xc4 \ xe3 \ xba \ xc3 '. decode ('gbk') # Hello print (B '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd '. decode ('utf8') # Hello print (B '\ xff \ xfe 'o} Y '. decode ('utf16') # Hello |
''' Encoding Problems are generally very annoying. Let's take a look at how to solve the encoding problem. (1) UnicodeEncodeError (2) UnicodeDecodeError ''' if _ name _ = "_ main _": # (1) unicodeEncodeError # Use the errors parameter s1 = "hello, you are fat ". encode ('Latin-1 ', errors = 'ignore') print (s1) # B 'hello' uses errors = 'ignore' to ignore unencoded characters s2 = "hello, you are fat ". encode ('Latin-1 ', errors = 'replace') print (s2) # B 'hello ????? 'Use errors = 'replace 'to replace the unencoded characters with "hello ". encode ('Latin-1 ', errors = 'xmlcharrefresh') print (s3) # B 'hello & #65292; & #20320; & #38271; & #32982; & #21862; 'use errors = 'xmlcharrefresh' to replace unencoded content with an XML Entity # (2) UnicodeDecodeError # garbled characters are called garbled characters, the following example demonstrates the occurrence of a ghost character s4 = B 'Montr \ xe9al' print (s4.decode ('cp1252') # Montr éal print (s4.decode ('iso8859 _ 7 ')) # Montr Marshal print (s4.decode ('koi8 _ R') # Montr Marshal # print (s4.decode ('utf8') # error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte print (s4.decode ('utf8', errors = 'replace ') # Montr �al |
Continue tomorrow...
Python advanced articles
Python advanced-directory