Python advanced (IV)-text and byte sequences (Encoding Problems), python bytes

Last Update:2018-02-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python advanced (IV)-text and byte sequences (Encoding Problems), python bytes
Main content of this article

Character

Bytes

Structure and memory View

Conversion between characters and bytes-codecs

BOM ghost character

Continue tomorrow...

Python advanced-directory

The code in this article is on github: https://github.com/ampeeg/cnblogs/tree/master/pythonadvanced

Character

'''Character encoding is a problem that often plagued python programmers. I often encounter this headache During crawler writing. Starting from python3, the human language (text string) and machine language (Binary byte) are clearly distinguished. Before we start with a text string, we must define "character": character: unicode character. The element obtained from the str object in python3 is a Unicode character string. A string is a character sequence (which echoes the content in (1) '''if _ name _ = "_ main _": # create a character s1 = str ('A ') s2 = 'B' s3 = u'c' print (s1, s2, s3) # a B c

Remember that the character in python3 is unicode, that is, str is unicode, which is a language that humans can understand.

Bytes

'''Python3 has two built-in binary sequence types: immutable bytes, variable bytearray (1), and bytearray ~ Integer between 255 (8 bits) and (2) the slice of the binary sequence is always the same type of binary sequence ''' if _ name _ = "_ main __": # create bytes and bytearray b1 = bytes ('abc ', encoding = 'utf8') # About encode, I don't know if anyone, like me, always obfuscated the encoding and decoding directions. print (b1) # B 'abc \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd 'b2 = bytearray ('abc Hi ', encoding = 'utf8') print (b2) # bytearray (B 'abc \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd ') # Slice (Tip: All sequences can be sliced) print (b1 []) # B '\ xe4 \ xbd' print (b2 []) # bytearray (B '\ xe4 \ xbd') # use the list to retrieve Try the value method print (b1 [3]) #228 at this time, the obtained result is not a byte sequence, but an element for _ in b1: print (_, end = ', ') #99,228,189,160,229,165,189, which are all 8-bit integers # the variable of bytes. bytearray variable # b1 [3] = 160 # error: 'bytes 'object does not support item assignment print (id (b2), b2) #4373768376 bytearray (B 'abc \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd ') b2 [2] = 78 print (id (b2), b2) #4373768376 bytearray (B 'abn \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd ') # convert b2 to a string to see print (B 2. decode ('utf8') # abN hello # Note: utf8 can be converted to unicode here because the ascii code of N is consistent with that of utf8 b2.extend (bytearray ('added content ', encoding = 'utf8') # since it is a variable sequence, bytearray certainly has the general sequence method print (id (b2), b2) #4373768376 bytearray (B 'abn \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd \ xe6 \ xb7 \ xbb \ xe5 \ x8a \ xa0 \ xe7 \ x9a \ x84 \ xe5 \ x86 \ x85 \ xe5 \ xae \ xb9 ') print (b2.decode ('utf8') # abN what you want to add # PS: You can regard the binary sequence as a list, and the element is ascii code (0 ~ 255)

Structure and memory View

'''Struct can extract structured information from binary sequences. The struct module provides functions to convert the packaged byte sequence into tuples composed of different types of fields. Other functions are used for reverse conversion. The struct module can process bytes, bytearray, and memoryview objects. '''Import structif _ name _ = "_ main _": # The memoryview class is used to share memory, you can access other binary sequences, packaged arrays, and data slices in the buffer. You do not need to assign a value to the byte sequence fmt = '<3s3shh' # set the format. <it is a small byte sequence, 3s3s is two 3-byte sequences, and HH is two 16-bit binary integers with open('l3_tu_python.jpg ', 'rb') as f: img = memoryview (f. read () print (bytes (img [: 10]) # B '\ xff \ xd8 \ xff \ xe0 \ x00 \ x10JFIF \ x00 \ x01 \ x01 \ x02 \ x00 \ x1c \ x00 \ x1c \ x00 \ x00' print (struct. unpack (fmt, img [: 10]) # (B '\ xff \ xd8 \ xff', B '\ xe0 \ x00 \ x10', 17994,179 93): unpack del img

Conversion between characters and bytes-codecs

'''Python comes with over 100 middle-end codecs for mutual conversion between strings and bytes. Each encoding has multiple names, such as 'utf _ 8', 'utf8', 'utf-8', and 'u8', which can be passed to open (), str. encode (), bytes. the encoding parameter '''if _ name _ = "_ main _": # Check the encoding effect for codec in ['gbk ', 'utf8', 'utf16']: print (codec, "hello ". encode (codec), sep = '\ t ') '''gbk B '\ xc4 \ xe3 \ xba \ xc3 'utf8 B' \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd 'utf16 B '\ xff \ xfe' O} Y ''' # decompress print (B '\ xc4 \ xe3 \ xba \ xc3 '. decode ('gbk') # Hello print (B '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd '. decode ('utf8') # Hello print (B '\ xff \ xfe 'o} Y '. decode ('utf16') # Hello

''' Encoding Problems are generally very annoying. Let's take a look at how to solve the encoding problem. (1) UnicodeEncodeError (2) UnicodeDecodeError ''' if _ name _ = "_ main _": # (1) unicodeEncodeError # Use the errors parameter s1 = "hello, you are fat ". encode ('Latin-1 ', errors = 'ignore') print (s1) # B 'hello' uses errors = 'ignore' to ignore unencoded characters s2 = "hello, you are fat ". encode ('Latin-1 ', errors = 'replace') print (s2) # B 'hello ????? 'Use errors = 'replace 'to replace the unencoded characters with "hello ". encode ('Latin-1 ', errors = 'xmlcharrefresh') print (s3) # B 'hello & #65292; & #20320; & #38271; & #32982; & #21862; 'use errors = 'xmlcharrefresh' to replace unencoded content with an XML Entity # (2) UnicodeDecodeError # garbled characters are called garbled characters, the following example demonstrates the occurrence of a ghost character s4 = B 'Montr \ xe9al' print (s4.decode ('cp1252') # Montr éal print (s4.decode ('iso8859 _ 7 ')) # Montr Marshal print (s4.decode ('koi8 _ R') # Montr Marshal # print (s4.decode ('utf8') # error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte print (s4.decode ('utf8', errors = 'replace ') # Montr �al

Continue tomorrow...

Python advanced articles

Python advanced-directory

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python advanced (IV)-text and byte sequences (Encoding Problems), python bytes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python advanced (IV)-text and byte sequences (Encoding Problems), python bytes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support