Array
The array type is described by a die, which defines whether the data is in the primary order (like in fortan) or in the active order (like in C or C ++ ). The index of this array is represented by a subrange type, which provides the upper and lower limits of each dimension. This allows dwarf to describe an array in the form of C, always using 0 as the minimum index; and in Pascal or Ada, any value can be used as the upper and lower limit arrays.
Struct, class, union, and interface
Most languages allow programmers to centralize data into structures (struct in C and C ++ and record in Pascal ). Each part of this struct usually has a unique name and may have different types, and each part has its own space. C and C ++ have union, while Pascal has variant record. They are similar to a struct, but each part occupies the same memory location. The Java interface has a subset of the C ++ class feature, because it may only have abstract methods and constant data members.
Although each language has its own terminology (C ++ calls these parts a class member, and Pascal calls them a field), the underlying structure can be described in dwarf. With respect to its tradition, dwarf uses the C/C ++ term and has a die describing struct, union, class, and interface. Here we will describe the class die, but other die has the same structure in essence.
The die used for a class is the father of the die that describes each member of the class. Each class has a name and may have other attributes. If the size of an instance is known at the time of compilation, it will have a byte size attribute. Each of these descriptions looks very similar to a description of a simple variable, although there may be some additional attributes. For example, C ++ allows programmers to specify whether a member is public, private, or protected. This is described by the accessibility attribute.
C and C ++ allow bit fields that are not simple variables as a class member. They are described as the offset from the beginning of the class instance to the leftmost bits of the bit field, and the bit size (bit size) of the member ).
Variable
Variables are usually quite simple. They have a name that represents a memory (or register) that can contain a value of a certain type ). The type of the value that this variable can contain and the modification restriction (that is, whether it is a constant) are described by the type of the variable.
The difference between variables is: where the variable is stored and its scope. The scope of a variable defines where the variable is known in the program, and to some extent, it is determined by where the variable is declared. In C, variables declared in a function or block have a function or block scope. Variables declared outside a function have a global or file scope. This allows you to define variables with the same name in different files without causing conflicts. This also allows different functions or compilation units to reference the same variable. Dwarf uses a (file, line, column) triplet to record where the variable is defined in the source file.
Dwarf divides variables into three categories: constants, function parameters, and variables. A constant is used to describe a language with a real named constant (true named constants), such as the ADA parameter. (C does not use constants as the language part. Declaring a variable const only tells you that you cannot modify this variable without using an explicit conversion ). A formal parameter represents the value passed to a function. We will return to this topic later.
Some languages, such as C or C ++ (but not Pascal), allow declaring a variable without defining it. This implies that there should be a real definition of the variable elsewhere, where the compiler or debugger can find. A die describing a variable Declaration provides a description of the variable, but does not tell the debugger where it is.
Most variables have a location attribute that describes where the variable is stored. In the simplest case, a variable is stored in the memory and has a fixed address [1]. However, many variables are dynamically allocated, such as those declared in a C function, and positioning them requires some (usually simple) computations. For example, a local variable may be allocated to the stack and positioning it may be as simple as adding a fixed offset to a frame pointer. In other cases, this variable may be stored in a register. Other variables may require a slightly complex calculation to locate the data. A variable that is a member of a C ++ class may require more complex calculations to determine the position of the base class in a derived class.
Location expression)
Dwarf provides a very general solution to describe how to locate the data represented by a variable. A dwarf position expression contains a series of operations that tell the debugger how to locate the data. Figure 7 shows three die variables named A, B, and C. Variable A has a fixed position in the memory, variable B is in register 0, and variable C is shifted to 12 in the current function stack box. Although a is declared first, its die is generated after all functions. The actual address of a will be filled in by the linker.
[1] In addition, it may not be a fixed address, but a fixed offset to the executable code loading address. Relocate references the addresses in an executable image. In this way, the location attribute contains the actual memory address at runtime. In a target file, the location attribute is the offset along with a suitable relocation table entry ).
Figure 7. Dwarf description of variables A, B, and C
The dwarf position expression can contain a series of operations and values evaluated by a simple stack machine. This can be a complex computing, including a wide range of arithmetic operations, testing and redirection in the expression, and evaluation of calls to expressions in other locations, and access the memory or registers of the processor. Some operations are even used to describe data split and stored in different locations. For example, a struct, some of which are stored in memory, and some are saved in registers.
Although this huge flexibility is rarely used, location expressions should allow describing the location of a variable data, no matter how complicated the definition of the language is, or how intelligent the compiler optimization is.
Description of executable code functions and subprograms
Dwarf processes the return value function and the Child routines without return value as different variants of the same thing. Slightly deviated from the term C at the beginning, and dwarf uses a subprogram die to describe the two. The die has a name, a source location triplicate, and a property that indicates whether the subroutine is an external attribute, that is, it is visible outside the current compilation unit.
The attributes of a subprogram die indicate the upper and lower limits of memory occupied by this subprogram. If the subprogram is continuous, or a list of memory ranges, if the function does not occupy a group of consecutive memory addresses. The low PC address is assumed as the entry of this routine unless another one is explicitly specified.
The value returned by a function is provided by the Type attribute. Routines that do not return values (such as the void function of C) do not have this attribute. Dwarf does not describe a function call Convention; this is defined in the application binary interface-Abi of a specific architecture. It may help a debugger locate the program data or find the attributes of the caller of the current subroutine. The returned address attribute is a location expression that specifies the caller's saved address. The frame base attribute is a location expression used to calculate the frame address of the function stack. These are useful because some compilers may execute the most common optimization: Eliminate the explicit save return address or frame pointer
Pointer.
Subprogram die owns the die that describes this subprogram. The die variable with the variable parameter attribute indicates the parameter that may be passed to a function. If this parameter is optional or has a default value, these parameters are represented by attributes. The die order of these parameters is the same as the real parameter list of this function, but additional die may be inserted in the middle, for example, defining the type used by these parameters.
A function may define local or global variables. The die of these variables follows the die parameter. Many languages allow nested lexical blocks ). These are represented by the lexical block die, which in turn may have the variable die or nested lexical block die.
Here is a slightly longer example. Figure 8a shows the source code of a function that copies a string in strndup. C, GCC. Figure 8b lists the dwarf generated for this file. As in the previous example, the source code line information and location attributes are not displayed.
Figure 8a. strndup. C source code
In Figure 8B, die <2> shows the definition of size_t, which is a typedef of the unsigned Int. This allows a debugger to display the type of the parameter n as a size_t and the value as an unsigned integer. Die <5> describes the strndup function. It owns a pointer to its brother die <10>, and then all the die is the child of this subprogram die. This function returns a pointer to char in Die <10>. Die <5> describes the child routine as an external function with the original type, and provides the upper and lower Pc values of the routine. The parameters and local variables of this routine are described in Die <6> to <9>.
Figure 8B. Dwarf description of strndup. c
Compilation Unit
Most interesting programs contain multiple files. Each source file that constitutes a program is compiled independently and linked to the system library to form the program. Dwarf calls each independently compiled source file a compilation unit.
The dwarf data of each compilation unit starts with a compilation unitdie. This die contains general information about the compilation unit, including the Directory and source file name, the programming language used, and a string of the publisher that identifies the dwarf data, and an offset to the dwarf data section that assists in locating row numbers and macro information.
If the compilation unit is continuous (that is, it is loaded into a piece of memory), there is a value of the upper and lower limits of the unit memory. This makes it easier for the debugger to identify which compilation unit builds code at a specific memory address. If the compilation unit is not continuous, the compiler and the linker provide a set of memory addresses occupied by the Code.
Compilation unit die is the father of all die describing the compilation unit. Generally, the starting die (multiple) describes the data type, followed by global data, and then forms the function of this source file. Die Used for variables and functions appear in the same order as those variables and functions appear in the source file.
Data Encoding
Conceptually, the dwarf data describing a program is a tree. Each die may have a brother and contain several die. Each of these die has a type (called its tag) and several attributes. Each attribute is represented by an attribute type and a value. Unfortunately, this is not a very compact code. Without compression, dwarf data is hard to process.
Dwarf versions 2 and 3 provide several methods to narrow down the data to be saved with the target file. The first is to use the prefix order to save the "flatten" tree. Each type of die is defined as either having children or having no. If die has no children, the next die is its brother. If a die can have children, the next die is its first child. The remaining child is represented as the first child's brother. In this way, links to brothers or child die can be eliminated. If the compiler thinks that it is useful to jump from a die to its brother without having to pass its sub-die one by one (for example, to the next function in a compilation unit ), you can add a sibling attribute to the die.
The second data compression Solution uses abbreviations. Although dwarf allows a high degree of flexibility in which die and attributes are generated, most compilers generate only a limited set of die with the same set of attributes. As an alternative to saving the tag value and Attribute-value pair of the die, only one index of the abbreviation table is saved, followed by the attribute code. Each abbreviation gives the value of the label-a set of attributes that indicate whether the die has children and has the expected type. Figure 9 shows the abbreviations used for die in Figure 8B. The die <6> in Figure 8 actually shows encoding [1]. This significantly reduces the amount of data to be saved at the cost of increasing complexity.
[1] encoding items also include the file and row values, which are not shown in Figure 8B.
Figure 9. Abbreviations and encoding formats
Dwarf Version 3 is rarely used to allow reference of dwarf data stored in another compilation unit or shared library from one compilation unit. Many compilers generate the same acronyms or basic types for each compilation unit, regardless of whether the compilation unit actually uses all these abbreviations or types. These can be saved in a shared library and referenced by each compilation unit, rather than copied in each compilation unit.
Other dwarf data row number tables
The dwarf linetable contains the ing between the source code line (the executable part of a program) and the memory containing the corresponding machine code. In the simplest form, this can be seen as a matrix. One column contains the memory address, and the other column contains the source code triplet (file, row, and column ). If you want to set a breakpoint on a specific row, this table provides you with the memory address to save the breakpoint. Conversely, if your program has a defect (for example, using a bad pointer) somewhere in the memory, you can view the source code lines closest to the memory address.
Dwarf is extended by adding columns that send additional program information. When a compiler optimizes this program, it may move or delete commands. The code of a given source code statement may not be saved as a machine command sequence, but may be scattered, and the commands of other nearby statements are inserted. It is useful to identify the code that represents the end of a function Prolog or the beginning of epilog. In this way, the debugger can stop after loading all the real parameters of a function or before the function returns. Some processors can execute multiple instruction sets, so another column indicates which set is saved at the specified Machine location.
As you may imagine, if this table is saved as one line per machine instruction, it will be huge. Dwarf encodes the data into a sequence of commands called the row number program [1] to compress the data. These commands are interpreted by a simple finite state machine to recreate the complete row number table.
This finite state machine is initialized using a set of default values. Each row in the row list is generated by executing one or more operation codes of the row number program. Generally, these operation codes are quite simple: for example, adding a value to the machine address or row number, setting the column number, or set a flag to indicate that the memory address represents the start of a source statement, the end of the function Prolog, or the start of the function epilog. A special set of operation codes combines the most common operations (incremental memory address and incremental or descending source code line number) into a single operation code (a single opcode ).
Finally, if a row in the row list has the same source code triple as the previous row, no command is generated for this row in the row list program. Figure 10 lists the row number programs of strndup. C. Note that only the address of the machine that represents the start command of a statement is saved. In this Code, the compiler cannot identify the basic block, the end of the function Prolog or the start of epilog. In the row number program, this table is only encoded into 31 bytes.
[1] This is a line numberprogram, which is a bit out of wording. This program describes much more than the row number, such as the instruction set, the start of the basic block, and the end of the function Prolog.
Figure 10. strndup. c row number table
Macro information
Most debuggers have difficulty displaying and debugging code with macros. The user views the original source files with these macros, and the Code corresponds to the results produced by these macros.
Dwarf includes the description of macros defined in this program. This is quite elementary information, but it can be used by a debugger to display a macro value, or it is possible to translate the macro back-to-source language.
Call Box Information
Each processor has a specific way to call functions and pass real parameters, which is usually defined in Abi. In the simplest case, this is the same for each function, and the debugger knows exactly how to find the value of the real parameter and the return address of the function.
For some processors, depending on how the function is written, there may be different call sequences. For example, if the number of real parameters is greater than a specific value, depending on the operating system, there may be different call sequences. The compiler will try to optimize the call sequence to make the code small and fast. A common optimization is to let a simple function (a leaf function) that does not call any other function use the caller's stack box instead of building its own. Another optimization may be to remove a register pointing to the current call box. Some registers may be retained during this call, but others may not. Although it is possible for the debugger to deduce the call sequence or optimize all possible arrangements, this is both boring and error-prone. A small modification to the optimization and debugger may not be moved to the call function in the stack.
The call frame information (CFI) provides the debugger with sufficient information about how a function is called, therefore, it can locate each real parameter of the function, locate the current call box, and locate the call box of the called function. This information is used by the debugger to "roll back the stack" and locate the previous function, the position where the function is called, and the passed value.
Similar to the row number table, CFI is encoded as a sequence of commands that will be interpreted to generate a table. In this table, each address containing the Code corresponds to a row. The first column contains the machine address, and the subsequent column contains the machine register value when the address command is executed. Similar to the row number table, if this table is built, it will be huge. Fortunately, the number of commands on both machines is very small, so the CFI encoding is quite compact.
Elf Section
Although dwarf is defined to allow it to be used together with any target file format, it is most often used together with elf. Different types of dwarf data are stored in their own sections. The names of all these sections start with ". debug. To improve efficiency, most dwarf data references use an offset starting with the data of the compilation unit. This avoids the redeployment of this debugging data and accelerates program loading and debugging.
The elf section and its content are
. Debug_abbrev is abbreviated in. debug_info
. Debug_aranges memory address and compilation unit ing
. Debug_frame call box information
. Debug_info contains die core dwarf data
. Debug_line row number Program
. Debug_loc macro description
. Debug_macinfo: A Global object and a function lookup table
. Debug_pubnames: A Search table for global objects and functions
. Debug_pubtypes: A Global lookup table
. Debug_ranges die quoted address range
. Debug_str string table used by. debug_info
Summary
Now you should understand the brief explanation of dwarf. Well, it's not very concise. The basic concept of dwarf debugging information is simple. A program is described as a tree with nodes that represent functions, data, and types in the source code in a compact language and machine-independent manner. A row table provides a ing between executable commands and the source code that generates them. CFI describes how to roll back the stack.
Similarly, there are also quite a few nuances in dwarf, considering that many different microdifferences need to be expressed for a wide range of programming languages and different machine architectures. The future direction of dwarf is to improve the description of the optimization code, so that the debugger can better process the code generated by advanced Compiler optimization.
You can download the complete dwarf Version 3 standard from the dwarf website for free (dwarf.freestandards.org ). There is also a list of emails for dwarf related issues and discussions. You can also see instructions for registering this email list on the website.
Thank you
I would like to thank Sun Microsystems's chrisquenelle and HP's former employee Ron brenders for their comments and suggestions on this article. She also thanked Susan Heimlich for providing many editing suggestions.
Use GCC to generate dwarf
Using GCC to generate dwarf is very simple. You only need to specify the-G option to generate debugging information. You can use the objump with the-H option to display the elf section.
$ Gcc-g-C strndup. c
$ Objdump-H strndup. o
Strndup. O: File Format elf32-i386
Sections:
Idx name size vma lma file off algn
0. Text 0000007b 00000000 00000000 00000034 2 ** 2
Contents, alloc, load, reloc, readonly, code
1. Data 00000000 00000000 00000000 bytes 00b0 2 *** 2
Contents, alloc, load, Data
2. BSS 00000000 00000000 00000000 000000b0 2 *** 2
Alloc
3. debug_abbrev 00000073 00000000 00000000 1000000b0 2 ** 0
Contents, readonly, debugging
4. debug_info 00000118 00000000 00000000 00000123 2 ** 0
Contents, reloc, readonly, debugging
5. debug_line 00000080 00000000 00000000 0000023b 2 ** 0
Contents, reloc, readonly, debugging
6. debug_frame 00000034 00000000 00000000 000002bc 2 *** 2
Contents, reloc, readonly, debugging
7. debug_loc limit 002c 00000000 00000000 000002f0 2 ** 0
Contents, readonly, debugging
8. debug_pubnames limit 001e 00000000 00000000 0000031c 2 ** 0
Contents, reloc, readonly, debugging
9. debug_aranges 00000020 00000000 00000000 0000033a 2 ** 0
Contents, reloc, readonly, debugging
10. Comment 0000002a 00000000 00000000 0000035a 2 ** 0
Contents, readonly
11. Note. GNU-stack 00000000 00000000 00000000 00000384 2 ** 0
Contents, readonly
Use readelf to print dwarf
Readelf can display and decode dwarf data in a target file or executable file. These options are:
-W displayall dwarf sections
-W [liaprmffso] display specific sections
L line table
I debug info
A abbreviation table
P PUBLIC names
R ranges
M macro table
F debug frame (encoded)
F debug frame (decoded)
S string table
O location lists
The list of dwarf, even if the smallest program is also quite a lot, so the readelf output is redirected to a file, then use less or an editor, such as VI, to browse this file, it is a good idea.