Use vstruct to parse binary data

Source: Internet
Author: User

Use vstruct to parse binary data

Vstruct is a module written purely in Python and can be used for parsing and serializing binary data. In fact, Vstruct is a submodule of the vivisect project, which is initiated by Invisig0th Kenshoto and is specially used for Binary analysis. Vstruct has been developed and tested for many years and has been integrated into many systems in the production environment. In addition, this module is not only easy to learn, but also interesting!

Are you still using the struct module to manually write scripts? It's so hard to use vstruct! Code developed using vstruct is often more declarative or declarative, and more concise and easy to understand, because a large amount of sample code is often carried during the compilation of binary parsing code, but vstruct does not. Declarative Code emphasizes the following important aspects of binary analysis: Offset, size, and type. This makes the vstruct-based parser easier for long-term maintenance.

0x00 install vstruct

The Vstruct module is an integral part of the vivisect project. Currently, this project is compatible with Python 2.7. Of course, the vivisect branch for Python 3.x is currently under development.

Because the sub-projects of vivisect are not setup compatible with setuptools. py file distribution, so you need to download the vstruct source code directory and put it in your Python path directory, such as the current directory:

$ git clone https://github.com/vivisect/vivisect.git vivisect$ cd vivisect$ pythonIn [1]: import vstructIn [2]: vstruct.isVstructType(str)Out[2]: False    

Of course, through setup. it is very troublesome to declare the Python module on which vstruct depends. For convenience, I provide a PyPI image package named vivisect-vstruct-wb. In this case, you can directly use the pip command to install vstruct:

$ mkdir /tmp/env$ virtualenv -p python2 /tmp/env$ /tmp/env/bin/pip install vivisect-vstruct-wb$ /tmp/env/bin/pythonIn [1]: import vstructIn [2]: vstruct.isVstructType(str)Out[2]: False    

I have updated this image and now it supports both Python 2.7 and Python 3.0 interpreters so that readers can continue to use vivisect-vstruct-wb in future projects. In addition, do not forget to go to Visi's GitHub to see if there are any ready-made answers.

0x01 getting started with Vstruct

The example below is equivalent to "Hello World!" when you are learning programming languages !" Program, which uses vstruct to parse the 32-bit unsigned integer of the Small-end mode in the byte string:

In [1]: import vstructIn [2]: u32 = vstruct.primitives.v_uint32()In [3]: u32.vsParse(b"\x01\x02\x03\x04")In [4]: he(u32)Out[4]: '04030201'    

Observe how the above Code creates a v_uint32 instance, how to use the. vsParse () method to parse the byte string, and how to process the final result like a native Python instance. To be more secure, I Want To explicitly convert the parsed object to a Python-only type:

In [5]: type(u32)Out[5]: vstruct.primitives.v_uint32    In [6]: python_u32 = int(u32)    In [7]: type(python_u32)Out[7]: int    In [8]: he(python_u32)Out[8]: '04030201'    

In fact, each vstruct operation is defined as a vs prefix method, which can be found in almost all vstruct-derived Resolvers. Although I usually use the. vsParse () and. vsSetLength () methods, we 'd better be familiar with the usage of all methods. The following is a brief summary of each method:

. VsParse () -- parses an instance from a byte string .. VsParseFd () -- parses an instance from an object of the file type (the. read () method must be used ).. VsEmit () -- serializes an instance into a byte string .. VsSetValue () -- assign values to instances using native Python instances .. VsGetValue () -- copy the instance data and use it as a native Python instance .. VsSetLength () -- set the length of an array type such as v_str .. VsIsPrim () -- returns True if the instance is of the simple primitive type .. VsGetTypeName () -- gets the string that stores the instance type name .. VsGetEnum () -- Obtain the v_enum instance associated with the v_number instance. If so .. VsSetmeta () -- (internal method ).. VsCalculate () -- (internal method ).. VsGetmeta () -- (internal method ).

So far, vstruct looks like a transgenic clone of struct. unpack, so next we need to introduce its cooler features.

0x02 advanced Vstructs features

The Vstruct parser is generally class-based. This module provides a set of basic data types (for example, v_uint32 and v_wstr are used for DWORD and wide string respectively ), and a corresponding mechanism to combine these types into more advanced data types (VStructs ). First, we will introduce the basic data types:

Vstruct. primitives. v_int8 -- signed integer. Vstruct. primitives. unsigned integer. Vstruct. primitives. bytes -- a sequence of the original bytes with a fixed length. Vstruct. primitives. v_str -- an ASCII string with a specific length. Vstruct. primitives. v_wstr -- a wide string with a specific length. Vstruct. primitives. v_zstr -- an ASCII string with NULL as the Terminator. Vstruct. primitives. v_zwstr -- a wide string with NULL as the Terminator. Vstruct. primitives. GUIDVstruct. primitives. v_enum -- specifies the integer type. Vstruct. primitives. v_bitmask -- specifies the integer type.

Complex Resolvers can be developed by defining subclasses of the vstruct. VStruct class, because the vstruct. VStruct class can contain many variables that can be an instance of the vstruct basic or advanced type. Okay, I admit that this sentence is a bit difficult, so let's take it over!

 Complex parsers are developed by defining subclasses of the `vstruct.VStruct`    class…    class IMAGE_NT_HEADERS(vstruct.VStruct):    def __init__(self):        vstruct.VStruct.__init__(self)    

In this example, we use vstruct to define the PE Header of a Windows Executable File. Our parser is named IMAGE_NT_HEADERS, which is derived from the class vstruct. VStruct. We mustInit() The constructor of the parent class is explicitly called in the method. The specific form can be vstruct. VStruct.Init(Self) or super (IMAGE_NT_HEADERS, self ).Init().

    …that contain member variables that are instances of `vstruct` primitives…    class IMAGE_NT_HEADERS(vstruct.VStruct):    def __init__(self):        vstruct.VStruct.__init__(self)        self.Signature      = vstruct.pimitives.v_bytes(size=4)

The first member variable of the IMAGE_NT_HEADERS instance is a v_bytes instance, which can store 4 bytes of content. V_bytes is usually used to store the original byte sequence without further parsing. In this example, the member variable. Signature is used to store the magic sequence "PE \ x00 \ x00" when parsing valid PE files ".

When defining this class, you can also add other member variables to parse the sequences of different parts of binary data. Class VStruct records the Declaration Order of member variables and handles other related record operations. The only thing you need to do is to decide in which order to use these types. Easy enough!

When a structure is used in various sub-structures, You can abstract them into reusable Vstruct types and then use them like using the basic vstruct types.

 [Complex parsers are developed by defining classes that contain] other complex `VStruct` types.    class IMAGE_NT_HEADERS(vstruct.VStruct):    def __init__(self):        vstruct.VStruct.__init__(self)        self.Signature      = v_bytes(size=4)        self.FileHeader     = IMAGE_FILE_HEADER()    

When the Vstruct instance parses binary data and encounters complex member variables, it can be solved by using a subparser in recursion mode. In this example, the member variable. FileHeader is a composite type, which is defined here. The IMAGE_NT_HEADERS parser first encounters four bytes of the. Signature field, and then passes the resolution control to the composite parser IMAGE_FILE_HEADER. We need to check the definition of this class to determine its size and layout.

My advice is to develop multiple Vstruct classes, each of which is responsible for a small part of the file format, and then combine them using a more advanced VStruct. In this way, debugging is easier, because each part of the parser can be checked separately. No matter what method is used, once a Vstruct is defined, You can parse the data through the pattern described at the beginning of the document.

In [9]:with open("kernel32.dll", "rb") as f:bytez = f.read()In [10]: hexdump.hexdump(bytez[0f8:0110])Out[10]:00000000: 50 45 00 00 4C 01 06 00  62 67 7D 53 00 00 00 00  PE..L...bg}S....00000010: 00 00 00 00 E0 00 0E 21   .......!In [11]: pe_header = IMAGE_NT_HEADERS()In [12]: pe_header.vsParse(bytez[0f8:0110])In [13]: pe_header.SignatureOut[13]: b'PE\x00\x00'In [14]: pe_header.FileHeader.MachineOut[14]: 332

When executing 9th commands, we opened a PE sample file and read its content into a byte string. When executing 10th commands, we use a hexadecimal format to display some content starting with the PE Header. When executing the 11th commands, we created an IMAGE_NT_HEADERS class instance, but note that it does not contain any parsed data. Since then, we have used 12th commands to explicitly parse a byte string that stores the PE Header. Through the 13th and 14 commands, we show the content of the Members who parse the instance. It should be noted that when we access an embedded composite Vstruct, we can further index its internal content, but when we access a basic type member, what we get is the data form of native Python. To be honest, this is really convenient!

During debugging, We can print the parsed data in the form of human readable using the. tree () method:

In [15]: print(pe_header.tree())Out[15]:00000000 (24) IMAGE_NT_HEADERS: IMAGE_NT_HEADERS00000000 (04)   Signature: 5045000000000004 (20)   FileHeader: IMAGE_FILE_HEADER00000004 (02)     Machine: 00000014c (332)00000006 (02)     NumberOfSections: 000000006 (6)00000008 (04)     TimeDateStamp: 0537d6762 (1400727394)0000000c (04)     PointerToSymbolTable: 000000000 (0)00000010 (04)     NumberOfSymbols: 000000000 (0)00000014 (02)     SizeOfOptionalHeader: 0000000e0 (224)00000016 (02)     Characteristics: 00000210e (8462)    
0x03 Vstruct advanced topic

Conditional member

Because the Vstruct layout is in this typeInit() Defined in constructor. Therefore, it can interact with these parameters and selectively include some members. For example, a Vstruct can behave differently on the 32-bit and 64-bit platforms, as shown below:

class FooHeader(vstruct.VStruct):def __init__(self, bitness=32):super(FooHeader, self).__init__(self)if bitness == 32:self.data_pointer = v_ptr32()elif bitness == 64:self.data_pointer = v_ptr64()else:raise RuntimeError("invalid bitness: {:d}".format(bitness))

This is a very powerful technology, although it requires a little bit of skill to use it correctly. It is important to know when their layout is finalized, when it is used for estimation, and when it is used for binary data parsing. WhenInit() When called, this instance does not access the data to be parsed. Only when. vsParse () is called will the data to be parsed be filled in the member variables. Therefore, the VStruct constructor cannot reference the content of the member instance to determine how to continue the following parsing. For example, the following code does not work:

class BazDataRegion(vstruct.VStruct):def __init__(self):super(BazDataRegion, self).__init__()self.data_size = v_uint32()# NO! self.data_size doesn't contain anything yet!!!self.data_data = v_bytes(size=self.data_size)

Callback Function

To correctly process the dynamic parser, we need to use the callback function of vstruct. When a VStruct instance completes parsing a member segment, it checks whether the class has a method with the same name prefixed with pcb _ (parser callback function, if yes, this method will be called. At the same time, the other part of the method name is the name of the resolved section. For example, once BazDataRegion. the data_size is parsed and is named BazDataRegion. the pcb_data_size method will be called. Of course, the premise is that this method does exist.

This is important because when the callback function is called, The VStruct instance has been filled with a portion of the data to be parsed. For example:

In [16]:class BlipBlop(vstruct.VStruct):def __init__(self):super(BlipBlop, self).__init__()self.aaa = v_uint32()self.bbb = v_uint32()self.ccc = v_uint32()def pcb_aaa(self):print("pcb_aaa: aaa: %s\n" % he(self.aaa))def pcb_bbb(self):print("pcb_bbb: aaa: %s"   % he(self.aaa))print("pcb_bbb: bbb: %s\n" % he(self.bbb))def pcb_ccc(self):print("pcb_ccc: aaa: %s"   % he(self.aaa))print("pcb_ccc: bbb: %s"   % he(self.bbb))print("pcb_ccc: ccc: %s\n" % he(self.ccc))In [17]: bb = BlipBlop()In [18]: bb.vsParse(b"AAAABBBBCCCC")Out[18]:pcb_aaa: aaa: 041414141pcb_bbb: aaa: 041414141pcb_bbb: bbb: 042424242pcb_ccc: aaa: 041414141pcb_ccc: bbb: 042424242pcb_ccc: ccc: 043434343

This means that we can postpone the final initialization of A Class layout until some binary data is parsed. The following is the correct method to implement a buffer with a specified size:

In [19]:class BazDataRegion2(vstruct.VStruct):def __init__(self):super(BazDataRegion2, self).__init__()self.data_size = v_uint32()self.data_data = v_bytes(size=0)def pcb_data_size(self):self["data_data"].vsSetLength(self.data_size)In [20]: bdr = BazDataRegion2()In [21]: bdr.vsParse(b"\x02\x00\x00\x00\x99\x88\x77\x66\x55\x44\x33\x22\x11")In [22]: print(bdr.tree())Out[22]:00000000 (06) BazDataRegion2: BazDataRegion200000000 (04)   data_size: 000000002 (2)00000004 (02)   data_data: 9988

In the First Command, we declare a structure with a header field (. data_size), indicating the size of the subsequent raw data (. data_data. Because we didn't have the value of the header to be parsed yet. After,Init() Is called. We use a callback function named. pcb_data_size (), which will be called when parsing the. data_size section. When this callback function is executed, the size of the. data_data byte array is updated to use the correct number of bytes. When executing the 20th commands, we created an instance of the parser and then parsed a string using the 21st commands. Although we have passed in 13 bytes, we want to use only 6 of them: 4 bytes for uint32 variable. data_size, 2 bytes for byte array. data_data. The remaining bytes are not processed. When executing the 22nd command, the result shows that our parser correctly parses the binary data.

Note that during the execution of the callback function. pcb_data_size (), we used square brackets to access the object named. data_data in the Vstruct instance. This is because we want to modify the sub-instance itself, but do not want to obtain the specific value to be resolved from the sub-instance. To find out which technology (self. field0.xyz or self ["field0"]. xyz), you need to explore it in practice, but in general, if you want to parse a specific value, you should avoid square brackets.

0x04 Summary

The vstruct module is a powerful assistant when we develop an maintainable binary code parser. It can remove a large amount of sample code from the development process. I especially like to use vstruct to parse malware's C2 protocol, database indexes, and binary XML files.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.