Reprinted from: http://blog.csdn.net/evan2008/article/details/8002958
Sometimes it is necessary to use Python to process binary data, for example, access to files, socket operation. At this point, you can use the Python struct module to do this. Structs can be used to manipulate structures in C.
The three most important functions in a struct module are pack (), unpack (), calcsize ()
Pack (FMT, V1, v2, ...) Encapsulates data in a given format (FMT) into a string (actually a byte stream similar to a C struct)
Unpack (FMT, String) parses a byte stream string in the given format (FMT), returning the parsed tuple
CalcSize (FMT) calculates how many bytes of memory a given format (FMT) occupies
The supported formats in a struct are the following table:
Format C Type Python byte count
X |
Pad byte |
No value |
1 |
C |
Char |
string of length 1 |
1 |
B |
Signed Char |
Integer |
1 |
B |
unsigned char |
Integer |
1 |
? |
_bool |
bool |
1 |
H |
Short |
Integer |
2 |
H |
unsigned short |
Integer |
2 |
I |
Int |
Integer |
4 |
I |
unsigned int |
Integer or Long |
4 |
L |
Long |
Integer |
4 |
L |
unsigned long |
Long |
4 |
Q |
Long Long |
Long |
8 |
Q |
Unsigned long long |
Long |
8 |
F |
Float |
Float |
4 |
D |
Double |
Float |
8 |
S |
Char[] |
String |
1 |
P |
Char[] |
String |
1 |
P |
void * |
Long |
Note 1.Q and q are only interesting when the machine supports 64-bit operation
Note 2. You can have a number in front of each format, indicating the number of
Note the 3.S format represents a length of string, and 4s represents a string of length 4, but p represents a Pascal string
Note 4. P is used to convert a pointer whose length is related to the machine word size
Note 5. The last one can be used to represent a pointer type, accounting for 4 bytes
In order to exchange data with structs in C, it is also necessary to consider that some C or C + + compilers use byte alignment, usually 32-bit systems in 4 bytes, and therefore structs are converted according to the local machine byte order. You can change the alignment by using the first character in the format. defined as follows:
Character Byte order Size and alignment
@ |
Native |
Native enough 4 bytes |
= |
Native |
Standard by original number of bytes |
< |
Little-endian |
Standard by original number of bytes |
> |
Big-endian |
Standard by original number of bytes |
! |
Network (= Big-endian) |
Standard by original number of bytes |
The use method is placed in the first position of the FMT, just like ' @5s6sif '
Example one:
For example, there is a structure
struct Header
{
unsigned short ID;
CHAR[4] tag;
unsigned int version;
unsigned int count;
}
Through SOCKET.RECV received a structure of the above data, the existence of the string s, now need to parse it out, you can use the unpack () function.
Import struct
ID, tag, version, Count = Struct.unpack ("! H4s2i ", s)
In the format string above,! Indicates that we want to use network byte order resolution because our data is received from the network, and it is the network byte order when it is transmitted over the network. The following H represents a unsigned short id,4s that represents a 4-byte long string, 2I indicates that there are two unsigned int types of data.
Through a unpack, now ID, tag, version, Count has saved our information.
Also, it is convenient to pack local data into a struct format.
SS = Struct.pack ("! H4s2i ", ID, tag, version, count);
The pack function converts the ID, tag, version, and count to the struct in the specified format Header,ss is now a string (actually a byte stream similar to the C struct) that can be sent out by Socket.send (ss).
Example two:
Import struct
a=12.34
#将a变为二进制
Bytes=struct.pack (' I ', a)
At this point, Bytes is a string literal, which is the same as the binary storage of a byte in bytes.
And then reverse the operation.
The existing binary data bytes, which is actually a string, translates it into a Python data type:
A,=struct.unpack (' i ', bytes)
Note that the unpack returns a tuple
So if there is only one variable:
Bytes=struct.pack (' I ', a)
Well, that's what it takes to decode.
A,=struct.unpack (' i ', bytes) or (A,) =struct.unpack (' I ', bytes)
If you use A=struct.unpack directly (' I ', bytes), then a= (12.34,) is a tuple instead of the original floating-point number.
My note: I do not know the author's original is wrong, in this explanation
When you convert a to 2, you should use Struct.pack (' F ', a) or struct.pack (' d ', a) to unpack the same format, where F has an error, and D has no error.
If it is composed of multiple data, you can:
A= ' Hello '
b= ' world! '
c=2
D=45.123
Bytes=struct.pack (' 5s6sif ', a,b,c,d)
At this point the bytes is the binary form of the data, you can write directly to the file such as Binfile.write (bytes)
Then, when we need to, we can read it again, Bytes=binfile.read ()
Then decode the python variable by struct.unpack ()
A,b,c,d=struct.unpack (' 5s6sif ', bytes)
' 5s6sif ' is called FMT, which is a formatted string, consisting of numbers plus characters, 5s representing a 5-character string, 2i, representing 2 integers, and so on, the following are the available characters and types, and the CType representation can correspond to type one by one in Python.
Note: Problems encountered while processing binary files
When we work with binary files, we need to use the following methods
Binfile=open (filepath, ' RB ') read the binary file
Binfile=open (filepath, ' WB ') write binary files
So what's the difference between the results and Binfile=open (filepath, ' R ')?
The difference is two places:
First, if you touch ' 0x1A ' when using ' R ', it will be considered as the end of the file, which is EOF. There is no problem with ' RB '. That is, if you use binary writing to read the text again, if there is ' 0X1A ' in it, only a portion of the file will be read. Using ' RB ' will always read the end of the file.
Second, for the string x= ' abc\ndef ', we can use Len (x) to get its length to 7,\n what we call a newline character, which is actually ' 0X0A '. When we write with ' W ' as text, the ' 0X0A ' is automatically changed to two characters ' 0X0D ', ' 0X0A ', that is, the length of the file actually becomes 8 in the Windows platform. When read with the ' R ' text, it is automatically converted to the original newline character. If you replace it with a ' WB ' binary, it will keep one character intact and read as is. So if you write it in text and read it in binary mode, consider the extra byte. ' 0X0D ' is also called carriage return. Linux does not change. Because Linux uses only ' 0X0A ' to represent line breaks.
Python uses struct to handle binary