[Android Security] Dex File Format Analysis

Source: Internet
Author: User
Tags try catch

Copy from:11900000076529370x00 Preface

The best way to parse the Dex file format is to find an introductory document, write a simple demo yourself and then use 010Editor to compare the analysis. Documents can refer to the official document Http://source.android.com/devices/tech/dalvik/dex-format.html, English Poor can also find a Chinese, for example, I ...


010Editor This tool is more useful, it is also used to analyze the elf file before. In fact, as long as the template installed, you can analyze a lot of files. Although it is a paid software, there is a 30-day free trial. 

But what if you use a Mac???? ????????~/.config/SweetScape/010 Editor.ini.


0x01 File Layout

The Dex file can be divided into 3 modules, header file (header), index area (xxxx_ids), and data area. The profile of the header file describes the distribution of the entire Dex file, including the size and offset of each index area. IDs in the index area areidentifiersabbreviations that represent the identity of each data, and the index area is primarily an offset to the data area.

010Editor In addition to the data area is not displayed, the other sections are displayed, the other link_data in the template is designated as Map_list


0x02 Header

The header describes the Dex file information, and the indexes for each of the other extents. 010Editor (write 010Editor a little trouble below directly write 010) using the structurestruct header_itemto describe the header.

Two data types, char, uint, are used. The char here is C + + char in 8-bit, the char in Java is 16-bit a little bit different, but we can show him Short/ushort This later describes the latest written tool. Official documents are defined by Ubyte, which is still official.

Structure Description:

ubyte 8-bit unsinged int
uint  32-bit unsigned int, little-endian

struct header_item 
{
    ubyte[8]  magic;
    unit      checksum; 
    ubyte[20] signature; 
    uint      file_size; 
    uint      header_size; 
    unit      endian_tag; 
    uint      link_size; 
    uint      link_off; 
    uint      map_off;
    uint      string_ids_size; 
    uint      string_ids_off; 
    uint      type_ids_size; 
    uint      type_ids_off; 
    uint      proto_ids_size; 
    uint      proto_ids_off; 
    uint      method_ids_size; 
    uint      method_ids_off; 
    uint      class_defs_size; 
    uint      class_defs_off; 
    uint      data_size;
    uint      data_off; 
}

In addition to magic, checksum, signature, file_size, Endian_tag, Map_off other elements are paired. _off represents the offset of an element, and _size represents the number of elements. The remaining 6 descriptions are primarily information of the Dex file.

    • Magic: This is a fixed value that is used to identify the Dex file. Convert to String:

{0x64, 0x65, 0x78, 0x0A, 0x30, 0x33, 0x35, 0x00} = "dex\n035\0"

The middle is a newline, and the back 035 is the version number.

    • Checksum: File check code, use the ALDER32 algorithm checksum file to remove all the remaining file areas MAIGC, checksum, to check for file errors.

    • Signature: Use the SHA-1 algorithm hash to remove all remaining file areas except magic, checksum, and signature for uniquely identifying this file.

    • File_size:dex File Size

    • The size of the Header_size:header area is currently fixed to 0x70

    • Endian_tag: Size end label, dex file format small end, fixed value 0x12345678 constant

    • The offset address of the Map_off:map_item, which belongs to the contents of the data area, with a value greater than or equal to the size of Data_off, at the end of the Dex file.

0x03 String_ids

The String_ids section describes all the strings in the Dex file. The format is simple with only one offset, and the offset points to a string in the String_data segment:

The above description mentions the LEB128 (little endian base 128) format, which is an indefinite length encoding based on 1 bytes. If the highest bit of the first byte is 1, then the next byte is also required to describe it until the highest bit of the last byte is 0. The remaining bits of each byte are used to represent the data, as shown in the following table. In fact, the largest LEB128 can only reach 32-bit to read the Dalvik in the Leb128.h source to see.


The data structure is:

ubyte    8-bit unsinged int
uint     32-bit unsigned int, little-endian
uleb128  unsigned LEB128, valriable length

struct string_ids_item
{
    uint string_data_off;
}

struct string_data_item 
{
    uleb128 utf16_size;
    ubyte   data; 
}

Where data holds the value of the string. String_ids is more critical, and many of the subsequent sections are directly pointing to the index of String_ids. You also need to extract the string_ids when writing the tool for comparison.


0x04 Type_ids

The Type_ids area indexes all data types in the Dex file, including the class type, the array type (array types), and the base type
(primitive types). The element format in the section is Type_ids_item, and the structure is described as follows:

uint 32-bit unsigned int, little-endian

struct type_ids_item
{
    uint descriptor_idx;  //-->string_ids
}

The meaning of the DESCRIPTOR_IDX value inside the Type_ids_item is the index number in the String_ids, which is the string used to describe this type.


0x05 Proto_ids

Proto means that method prototype represents a prototype of a method in the Java language. The elements in the Proto_ids are Proto_id_item and are structured as follows:


uint 32-bit unsigned int, little-endian 

struct proto_id_item
{
    uint shorty_idx;        //-->string_ids
    uint return_type_idx;    //-->type_ids
    uint parameters_off;
}


    • Shorty_idx: Like Type_ids, its value is a string_ids index number, which is ultimately a short string description to illustrate the method prototype.

    • RETURN_TYPE_IDX: Its value is the index number of a type_ids that represents the return value type of the method prototype.

    • Parameters_off: A parameter list that points to the method prototype Type_list, if method has no parameters, the value is 0. The format of the parameter list is type_list, which is described below.

0x06 Field_ids

The Filed_ids area has all the field referenced by the Dex file. The element format of the section is Field_id_item, with the following structure:


ushort 16-bit unsigned int, little-endian 
uint   32-bit unsigned int, little-endian 

struct filed_id_item
{
    ushort class_idx;  //-->type_ids
    ushort type_idx;   //-->type_ids
    uint   name_idx;   //-->string_ids
}


    • Class_idx: Represents the class type to which the field belongs, the value of Class_idx is an index of type_ids and must point to a class type.

    • Type_idx: Represents the type of this field, and its value is also an index of type_ids.

    • Name_idx: Represents the name of this field, and its value is an index of string_ids.

0x07 Method_ids

Method_ids is the last entry in the index area, describing all the method in the Dex file. The Method_ids element format is Method_id_item, and the structure is similar to Fields_ids:


ushort 16-bit unsigned int, little-endian 
uint   32-bit unsigned int, little-endian 

struct filed_id_item
{
    ushort class_idx;  //-->type_ids
    ushort proto_idx;   //-->proto_ids
    uint   name_idx;   //-->string_ids
}


    • Class_idx: Represents the class type to which the method belongs, the value of Class_idx is an index of type_ids and must point to a class type. <font Color=red>ushort Type is also why we say a Dex can only have 65,535 methods for the reason that more must be subcontracting </font>.

    • Proto_idx: Represents the type of method, and its value is also an index of type_ids.

    • Name_idx: Represents the name of the method, and its value is an index of string_ids.

0x08 Class_defs

Class_def section is mainly the definition of class, its structure is very complex, look at me a little dizzy, one layer of a layer. Let's look at a 010 structure diagram:

Look at all dizzy, don't say the time to parse.

Class_def_item

The CLASS_DEF_ITEM structure is described as follows:

uint   32-bit unsigned int, little-endian

struct class_def_item
{
    uint class_idx;         //-->type_ids   
    uint access_flags;        
    uint superclass_idx;    //-->type_ids
    uint interface_off;     //-->type_list
    uint source_file_idx;    //-->string_ids
    uint annotations_off;    //-->annotation_directory_item
    uint class_data_off;    //-->class_data_item
    uint static_value_off;    //-->encoded_array_item
}
  • CLASS_IDX: Describes the specific class type, and the value is an index of type_ids. The value must be a class type and cannot be an array type or base type.
  • Access_flags: Describes the type of access for class, such as public, final, static, and so on. In dex-format.html "Access_flags definitions" has a specific description.
  • SUPERCLASS_IDX: Describes the type of supperclass, in the form of a value similar to CLASS_IDX.
  • Interfaces_off: The value is the offset address, which points to the interfaces of class, and the data structure to which it is directedtype_list. Class if there is no interfaces value of 0.
  • Source_file_idx: Represents the source code file information, and the value is an index of string_ids. If this information is missing, this value is assigned a value of NO_INDEX=0XFFFF FFFF.
  • Annotions_off: The value is an offset address, which is the comment of the class, located in the data area, in the formatannotations_direcotry_item. If this is not the case, the value is 0.
  • Class_data_off: The value is an offset address, which refers to the data used by the class, and is in the format in data areaclass_data_item. If not, the value of this entry is 0. This structure has a lot of content, detailed description of the Class field, method, method of execution code and other information, will be described laterclass_data_item.
  • Static_value_off: The value is an offset address that points to a list in the data area, in the formatencoded_array_item. If not, the value of this entry is 0.
Type_list

Type_list in the data section, Class_def_item->interface_off refers to the information here. The data structure is as follows:


uint   32-bit unsigned int, little-endian

struct type_list
{
    uint       size;
    type_item  list [size] 
}

struct type_item
{
    ushort type_idx   //-->type_ids
}


    • Size: Indicates the number of types

    • TYPE_IDX: Corresponds to the index of a type_ids

Annotations_directory_item

Class_def_item->annotations_off points to the data section, defines the annotation related data description, data structure as follows:

uint   32-bit unsigned int, little-endian

struct annotation_directory_item
{
    uint class_annotations_off;        //-->annotation_set_item
    uint fields_size;
    uint annotated_methods_size;
    uint annotated_parameters_size;
    
    field_annotation field_annotations[fields_size];
    method_annotation method_annotations[annotated_methods_size];
    parameter_annotation parameter_annotations[annotated_parameters_size];
}

struct field_annotation
{
    uint field_idx;
    uint annotations_off;    //-->annotation_set_item
}

struct method_annotation
{
    uint method_idx;
    uint annotations_off;    //-->annotation_set_item
}

struct parameter_annotation
{
    uint method_idx;
    uint annotations_off;    //-->annotation_set_ref_list
}
    • Class_annotations_off: This offset pointsannotation_set_item to a specific description that can be seen on the dex-format.html.

    • Fields_size: Indicates the number of attributes

    • Annotated_methods_size: Indicates the number of methods

    • Annotated_parameters_size: Indicates the number of parameters

Class_data_item

Class_data_off points to the CLASS_DATA_ITEM structure in the data area, Class_data_item contains the various data used by this class, and the following is the structure of the Class_data_item:

uleb128 unsigned little-endian base 128 

struct class_data_item
{
    uleb128 static_fields_size;
    uleb128 instance_fields_size;
    uleb128 direct_methods_size;
    uleb128 virtual_methods_size;

    encoded_field  static_fields[static_fields_size];
    encoded_field  instance_fields[instance_fields_size];
    encoded_method direct_methods[direct_methods_size];
    encoded_method virtual_methods[virtual_methods_size];
}

struct encoded_field
{
    uleb128 filed_idx_diff; 
    uleb128 access_flags;  
}

struct encoded_method
{
    uleb128 method_idx_diff; 
    uleb128 access_flags; 
    uleb128 code_off;
}

Class_data_item

    • Static_fields_size: Number of static member variables

    • Instance_fields_size: Number of instance member variables

    • Direct_methods_size: Number of direct functions

    • Virtual_methods_size: Number of virtual functions

Here are a few of the descriptions for


Encoded_field

    • Method_idx_diff: The prefix METHD_IDX indicates that its value is an index of method_ids, and the suffix _diff indicates that it is a difference from another method_idx, which is relative to Encodeed_method [] The difference between the method_idx of the previous element in the array. In fact encoded_filed-> Field_idx_diff said the same meaning, just compiled Hello.dex file is not used in the class filed so no careful, detailed reference dex_format.html Official documents.

    • Access_flags: Access rights, such as public, private, static, final, and so on.

    • Code_off: An offset address to the data area where the target is the code implementation of this method. The structure pointed to is Code_item, which has nearly 10 elements.

Code_item

The CODE_ITEM structure describes the specific implementation of a method, and its structure is described as follows:

struct code_item 
{
    ushort                         registers_size;
    ushort                         ins_size;
    ushort                         outs_size;
    ushort                         tries_size;
    uint                         debug_info_off;
    uint                         insns_size;
    ushort                         insns [insns_size]; 
    ushort                         paddding;             // optional
    try_item                     tries [tyies_size]; // optional
    encoded_catch_handler_list  handlers;             // optional
}

The 3 flags at the end are optional, indicating that there may or may not be, according to the specific code.

    • Registers_size: The number of registers to use for this section of code.

    • Ins_size:method the number of parameters passed in.

    • Outs_size: The number of arguments required for this code to call other method.

    • The number of tries_size:try_item structures.

    • Debug_off: Offset address, the location of the debug information that points to this code is adebug_info_itemstructure.

    • Insns_size: The size of the instruction list, in 16-bit units. Insns is the abbreviation of instructions.

    • padding: A value of 0 that is used to align bytes.

    • Tries and handlers: for handling exception in Java, the common syntax is try catch.

Encoded_array_item

The Class_def_item->static_value_off offset points to the segment data.


uleb128  unsigned LEB128, valriable length

struct encoded_array_item
{
    encoded_array value;
}

struct encoded_array
{    
    uleb128 size;
    encoded_value values[size];
}


    • Size: Indicates the number of Encoded_value

    • Encoded_value: I didn't analyze it.

0x09 map_list

Most of the item in Map_list is the same as the corresponding description in the header, which describes the offsets and sizes of each area, but is more comprehensive in map_list, including Header_item, Type_list, String_data_item, Debug_info_item and other information.


010 Map_list is indicated as:

The data structure is:

ushort 16-bit unsigned int, little-endian
uint   32-bit unsigned int, little-endian

struct map_list 
{
    uint     size;
    map_item list [size]; 
}
struct map_item 
{
    ushort type; 
    ushort unuse; 
    uint   size; 
    uint   offset;
}

Map_list first with a uint description is followed by a size map_item, followed by a corresponding size Map_item description. The MAP_ITEM structure has 4 elements: type denotes the map_item, Dalvik the definition of type Code in executable Format; Size indicates the number of the type to subdivide this item, and offset is the offset of the first element for the initial position of the file; Unuse is used to align bytes with no practical use.


[Android Security] Dex File Format Analysis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.