Python uses struct to process binary (pack and unpack usage)

Last Update:2017-09-25 Source: Internet

Author: User

Tags unpack

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from: http://www.cnblogs.com/gala/archive/2011/09/22/2184801.html

This article is very well written, so shameless turn.

Sometimes it is necessary to use Python to process binary data, for example, access to files, socket operation. At this point, you can use the Python struct module to do this. Structs can be used to manipulate structures in C.

The three most important functions in a struct module are pack (), unpack (), calcsize ()

# Number Fourth programmer http://www.coder4.com
12345678	`# 按照给定的格式(fmt)，把数据封装成字符串(实际上是类似于c结构体的字节流)pack(fmt, v1, v2, ...)` `# 按照给定的格式(fmt)解析字节流string，返回解析出来的tupleunpack(fmt, string)` `# 计算给定的格式(fmt)占用多少字节的内存calcsize(fmt)`

In the FMT above, the supported formats are:

Standard

FORMAT	C TYPE	PYTHON TYPE	SIZE	NOTES
X	Pad byte	No value
C	Char	string of length 1	1
B	Signed Char	Integer	1	(3)
B	unsigned char	Integer	1	(3)
?	_bool	bool	1	(1)
H	Short	Integer	2	(3)
H	unsigned short	Integer	2	(3)
I	Int	Integer	4	(3)
I	unsigned int	Integer	4	(3)
L	Long	Integer	4	(3)
L	unsigned long	Integer	4	(3)
Q	Long Long	Integer	8	(2), (3)
Q	unsigned long long	Integer	8	(2), (3)
F	Float	Float	4	(4)
D	Double	Float	8	(4)
S	Char[]	String
P	Char[]	String
P	void *	Integer		(5), (3)

Note 1.Q and q are only interesting when the machine supports 64-bit operation

Note 2. You can have a number in front of each format, indicating the number of

Note the 3.S format represents a length of string, and 4s represents a string of length 4, but p represents a Pascal string

Note 4. P is used to convert a pointer whose length is related to the machine word size

Note 5. The last one can be used to represent a pointer type, accounting for 4 bytes

In order to exchange data with structs in C, it is also necessary to consider that some C or C + + compilers use byte alignment, usually 32-bit systems in 4 bytes, and therefore structs are converted according to the local machine byte order. You can change the alignment by using the first character in the format. defined as follows:

CHARACTER	BYTE ORDER	SIZE	ALIGNMENT
@	Native	Native	Native
=	Native	Standard	None
<	Little-endian	Standard	None
>	Big-endian	Standard	None
!	Network (= Big-endian)	Standard	None

The use method is placed in the first position of the FMT, just like ' @5s6sif '

Example 1:

The structure is as follows:

# Fourth programmer http://www.coder4.com
1234567	`struct` `Header` `{` `&NBSP;&NBSP;&NBSP;&NBSP;` `unsigned` `short` `ID;` `&NBSP;&NBSP;&NBSP;&NBSP;` `char` `[4] tag;` `&NBSP;&NBSP;&NBSP;&NBSP;` `unsigned` `int` `version;` `&NBSP;&NBSP;&NBSP;&NBSP;` `unsigned` `int` `count;` `}`

Through SOCKET.RECV received a structure of the above data, the existence of the string s, now need to parse it out, you can use the unpack () function:

# Number Fourth programmer http://www.coder4.com
12	`import` `structid, tag, version, count` `=` `struct.unpack("!H4s2I", s)`

In the format string above,! Indicates that we want to use network byte order resolution because our data is received from the network, and it is the network byte order when it is transmitted over the network. The following H represents a unsigned short id,4s that represents a 4-byte long string, 2I indicates that there are two unsigned int types of data.

Through a unpack, now ID, tag, version, Count has saved our information.

Also, it is convenient to pack local data into a struct format:

# Number Fourth programmer http://www.coder4.com
1	`ss` `=` `struct.pack("!H4s2I",` `id, tag, version, count);`

The pack function converts the ID, tag, version, and count to the struct in the specified format Header,ss is now a string (actually a byte stream similar to the C struct) that can be sent out by Socket.send (ss).

Example 2:

# Number Fourth programmer http://www.coder4.com
123456	`import` `struct` `a=12.34# 将a变为二进制bytes=struct.pack(‘i‘,a)`

At this point, Bytes is a string literal, which is the same as the binary storage of a byte in bytes.

Again, the existing binary data bytes, which is actually a string, translates it into a Python data type:

# Number Fourth programmer http://www.coder4.com
12	`# 注意，unpack返回的是tuple !!a,=struct.unpack(‘i‘,bytes)`

If it is composed of multiple data, you can:

# Number Fourth programmer http://www.coder4.com
123456	`a=‘hello‘b=‘world!‘c=2d=45.123bytes=struct.pack(‘5s6sif‘,a,b,c,d)`

At this point the bytes is the binary form of the data, you can write directly to the file such as Binfile.write (bytes)

Then, when we need to, we can read it again, Bytes=binfile.read ()

Then decode the python variable by struct.unpack ():

# Number Fourth programmer http://www.coder4.com
1	`a,b,c,d=struct.unpack(‘5s6sif‘,bytes)`

' 5s6sif ' is called FMT, which is a formatted string, consisting of numbers plus characters, 5s representing a 5-character string, 2i, representing 2 integers, and so on, the following are the available characters and types, and the CType representation can correspond to type one by one in Python.

Note: Problems encountered while processing binary files

When we work with binary files, we need to use the following methods:

# Number Fourth programmer http://www.coder4.com
123	`binfile=open(filepath,‘rb‘)` `#读二进制文件binfile=open(filepath,‘wb‘)` `#写二进制文件`

So what's the difference between the results and Binfile=open (filepath, ' R ')?

The difference is two places:

First, if you touch ' 0x1A ' when using ' R ', it will be considered as the end of the file, which is EOF. There is no problem with ' RB '. That is, if you use binary writing to read the text again, if there is ' 0X1A ' in it, only a portion of the file will be read. Using ' RB ' will always read the end of the file.

Second, for the string x= ' abc\ndef ', we can use Len (x) to get its length to 7,\n what we call a newline character, which is actually ' 0X0A '. When we write with ' W ' as text, the ' 0X0A ' is automatically changed to two characters ' 0X0D ', ' 0X0A ', that is, the length of the file actually becomes 8 in the Windows platform. When read with the ' R ' text, it is automatically converted to the original newline character. If you replace it with a ' WB ' binary, it will keep one character intact and read as is. So if you write it in text and read it in binary mode, consider the extra byte. ' 0X0D ' is also called carriage return. Linux does not change. Because Linux uses only ' 0X0A ' to represent line breaks.

Python uses struct to process binary (pack and unpack usage)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More