C compiler profiling _ 5.2.3 intermediate code generation and optimization _ access array elements and struct members through "offset"

Source: Internet
Author: User

C compiler profiling _ 5.2.3 intermediate code generation and optimization _ access array elements and struct members through "offset"

Section 5.2.3 accesses array elements and struct members through "offset"

In the previous section, we introduced the access to "array elements and struct members". We use the "base address + offset" mode to calculate the address of the memory unit. For the array element arr2 [I] [2], the array index value I is a variable, and the corresponding address must be expressed as "base address + constant offset + extraordinary offset "; for the struct member dt. for B, the address can be expressed as "base address + constant offset ". Next, we will use a simple example to describe related concepts. As shown in 5.2.9, lines 1st to 14 provide a simple C program, and lines 16th to 30 are the intermediate code generated by the UCC compiler, lines 33rd to 46 are the compilation code generated by the UCC compiler, while lines 49th to 58 are the compilation code generated by the GCC compiler. Because the arr [I] of the first row contains "extraordinary offset", the C compiler needs to generate code to calculate the offset and then add it to the first address of the array. In assembly code, commands used for addressing are quite flexible. For "arr [I] = 30;", the code generated by GCC is different from that generated by UCC, 5.2.9 lines 50th to 51 are shown. The UCC compiler uses commands such as 34th to 38 rows, and 34th to 35 rows are used to calculate the "extraordinary amount offset", that is, I * 4, which is achieved by shifting I to 2 places left, line 2 uses the leal command to obtain the first address of the array arr, and line 3 adds the base address and offset, and the result is stored in the register ecx, then the value can be assigned through the indirect addressing of the 38th-row register. Line 1's intermediate code t1: & arr corresponds to line 2's assembly code leal arr, % ecx ". We also found that the C code "arr [2] = 50;" corresponding to 11th lines of assembly code is "movl $50, arr + 8 ". The arr symbol in the Assembly Code can be considered as an address constant. The movl Command sends the constant 50 to the memory unit corresponding to (arr + 8.


 

Figure 5.2.9 addressing array elements

At the intermediate code level, a symbolic object struct symbol (or its "subclass" object, such as struct variableSymbol) can be used as the destination or source operands in the three address codes. When the UCC compiler generates intermediate code for the "arr node in the abstract syntax tree", it does not consider its context. In order to generate Assembly commands such as 36th lines, the UCC compiler generates an intermediate code line t1: & arr; for example, line 1. The temporary variable t1 stores the first address of the array arr. Although the content in the array may be modified, the address of the array arr does not change during the array lifecycle, so & arr can be used as a public subexpression, when we encounter an arr knot in another abstract syntax tree in row 11th, We can reuse the value in the Temporary Variable t1. If the register corresponding to t1 is eax, the following assembly code can be generated for the C statement "arr [2] = 50;" in the assembly hierarchy.

Leal arr, % eax; // obtain the address of the array arr

Addl $8, % eax; // The offset of arr [2] in array arr is constant 8

Movl $50, (% eax); // assign values through indirect register addressing

These Assembly codes can implement the semantics required by the C statement "arr [2] = 50;", but it is not very efficient. We can use the "movl $50" in line 5.2.9 40th, arr + 8 "to implement the same function. With the knowledge of the base address and Offset, we can use the function Offset in the UCC compiler to generate the intermediate code for "accessing array elements or struct members", as shown in lines 5.2.9 18th to 20; the AddressOf function can generate the address FETCH Command for line 1 "t1: & arr.

As shown in code 5.2.10 of function Offset, when a C programmer accesses an array element or struct member, the parameter addr of row 2nd is the base address of an array element or struct member, the voff parameter is "extraordinary offset VariableOffset ". When the struct member dt. num in row 5.2.9 12th is accessed, the voff parameter is NULL because the struct member dt. num has a fixed offset in the struct object dt. However, when accessing dt. when num [I], the array element dt. num [I] in the array dt. the offset in num is (I * 4), which is not a constant. In this case, voff is not NULL, and voff is not NULL when arr [I] is accessed. Another coff parameter of row 2nd represents "constant offset ConstOffset". When arr [2] of row 11th in Figure 5.2.9 is accessed, the coff value is 2 * sizeof (int), that is, 8. Figure 5.2.10 3rd to 8 lines of code are used to generate code. After the addition operations of base address, extraordinary offset, and constant offset, then, the "indirect addressing operation" is performed by the Deref of line 1, so that we can generate intermediate code in the form of line 5.2.9 7th to line 21 for "arr [I] = 30. When the C programmer wants to access arr [2], t1 in the annotation of row 5.2.10 corresponds to the addr parameter, voff is NULL, and coff value is 8, to generate the assembly code "movl $50, arr + 8" in line 5.2.9, instead of generating "leal arr, % eax; addl $8, % eax; movl $50, (% eax); "The Three inefficient codes, we call the CreateOffset function in Row 3 to create a new symbol object, it is used to represent symbols such as arr [8] at the intermediate code level. For the code shown in the annotations from figure 5.2.10 13th to 14, ptr is a pointer to the int [4] array. When a C programmer accesses an array element through (* ptr) [2, the UCC compiler constructs a syntax tree (such as [] ([] ptr 0) 8) in the CheckUnaryExpression semantic check. When translating this syntax tree, we calculated that the base address is ptr, And the offset is 8. Now we add the two, and then access the corresponding array elements through indirect addressing. Row 15th calls Deref to complete this function. If we mistakenly represent the (* ptr) [2] written by C programmers as the symbol ptr [8] at the intermediate code level, the final assembly code will be "movl $50, ptr + 8 ". Assume that the first address of the array arr is 10000, and the global variable ptr address is 20000, the content of the variable ptr is 10000. In this movl instruction, ptr is the address constant 20000, the movl command transfers the constant 50 to the memory unit corresponding to address 20008. However, according to the semantics of C, (* ptr) [2] Actually should access the array element arr [2] of C. The address of the array element arr [2] is 10008, therefore, "movl $50, ptr + 8" is an incorrect command. Therefore, at the intermediate code level, we cannot use the ptr [8] symbol to represent the corresponding array element (* ptr) [2]. Instead, it is generated in the form of "t5: ptr + 8; * t5 = 60; "code. The intermediate code is generated by the function Deref in line 5.2.10 15th. Deref is short for Dereference and indicates the" pull operation ", the actual operation is "indirect addressing ".


 

Figure 5.2.10 Offset ()

To access the struct member dt. num, or the array element to be accessed does not have a "very large offset" (for example, arr [2]), we can use CreateOffset of rows 18th to 39 to create a symbolic object for it, the base of the row 18th represents the base address, and the coff of the row 19th represents the "constant offset ". If the offset coff is 0, for example, if we want to initialize the local variables d and dt in the comments of Figure 5.2.10 21st, the condition of row 22nd will be true. Then we can directly return the base. However, if we want to access dt. a, according to the struct definitions in rows 5.2.9 4th to 7, dt. the offset of a in the object dt is 0, but dt. the a and dt types are different, so we need to set it to dt. a creates a new symbol object, instead of using the same symbol object as dt. At this time, the condition of row 22nd is not true. Row 25th is used to allocate a symbolic object in the heap space. Row 30th sets the coordinates of the symbol in the C source code, and row 31st sets the flag addreesed to 1, this indicates that the object has been "address fetch" (in this way, when array elements and struct members are used as the operands in the expression, the expression is no longer treated as a public subexpression, we have introduced related concepts in section 5.2). Row 32nd sets the type of the symbol object to SK_Offset, row 33rd sets its type, and row 34th saves the symbol object corresponding to its base address, the first row stores the constant offset, and the second row is used to generate the symbol name, for example, "arr [8. When the base parameter of Row 3 corresponds to an array element or a "struct member", for example, dt. dt in num [2. num, For dt. for num, the base address is dt, which is defined by lines 5.2.9 4th to 7, dt. the offset of num In the struct object dt is 4, while that of the array element dt. num [2] in the array dt. the offset in num is 8. At the intermediate code level, we can add the two to get dt. the offset of num [2] In the struct object is 12. The Code in line 5.2.10 26th to 29 is used to complete these operations.

Figure 5.2.10 the 40th-row function Deref is mainly used to generate an indirect addressing command, for example, t3: * t2. The address stored in t2 is, * t2 indicates that "the address corresponds to the content in the memory unit" and stores the content in the Temporary Variable t3. the symbol t3 is returned as the result of "indirect addressing. Of course, if the addr parameter of row 40th is like t1 of row 44th, and t1 is created by the intermediate code "t1: & arr, indirect addressing operation * t1 can be simplified to arr access.

The AddressOf function of row 5.2.10 is used to generate the address FETCH command, for example, "t: & num", when necessary, the num should be the left value (with the address visible to the C programmer ). If the parameter p in line 1 is the result t3 after "indirect addressing", the indirect addressing command corresponding to t3 is "t3: * t2 ", then, the "get address operation & t3" can be simplified to t2, And the if statements from lines 52nd to 57 can be used to determine this. Then, t2 can be returned directly. When num is obtained, UCC calls the TrackValueChange function of row 61st to invalidate the common subexpression that uses num as the operand. UCC uses this policy to avoid complex processes such as "alias analysis". Of course, this will affect the quality of the generated code, and the UCC compiler has not done enough in optimization. Since the address of num does not change during its lifecycle, the value after the address is obtained for num can be used as a public subexpression, And the TryAddValue called in row 63rd is used for this purpose.

For a global variable or static variable number, we can understand the symbol number that appears in the C program. In the C code, we can interpret the symbol number as "content in the corresponding memory unit of the number. If the number is on the right of the value assignment number, the memory unit will be read; when number is on the left side of the value assignment number, write the memory unit ". If a C programmer wants to obtain the address of the memory unit, the expression & number is used.

// C code. number corresponds to a memory unit in the Global static zone.

Number = 30; // number is on the left of the value assignment number, indicating that the content of the number is to be rewritten.

A = number; // number is on the right of the value assignment number, indicating that the content of the number is to be read.

However, at the assembly code level, we can regard the symbolic number as an address constant. In the request paging operating system, the connector will eventually allocate a memory unit in the virtual space to the global variables and static variables, it is equivalent to replacing the symbolic number in the assembly code with an address constant. If you want to access the content of the corresponding memory unit, use the following movl command. If you want to obtain the address of the memory unit, use the leal Command, as shown below.

// If the address of the global variable number is 0x804a060

Movl number, % eax; // The content in register eax is 30

Leal number, % ebx; // the content in the register ebx is 0x804a060

If number is only a local variable, because its storage space is in the stack and is dynamically allocated, its symbolic name number will not appear in assembly code at all, it is represented by a symbol such as "-4 (% ebp)". The register ebp points to the stack space at runtime. during compilation, we can only calculate the offset of the local variable number in the stack. Its base address is unknown and it will be directed by the register ebp at runtime.

Of course, in C language, the array name is a special case. According to our previous understanding, at the C language level, the symbol arr should represent the content of the array. However, the C compiler processes the array names differently based on the context, which causes semantic inconsistency. This is also the source of many confusions caused by array names for many C programmers, such as what is the difference between arr and & arr. For the following array arr, In the symbol table, the type of the symbol arr is always an array type of int [4], but when the symbol arr is used in different fields, the types of corresponding expressions are inconsistent.

Int arr [4];

(1) The value of sizeof (arr) is 16, where the expression arr is an array type int [4];

(2) arr + 1. Here, arr is treated as the address of the array's 0th elements, and the type of arr [0] is int, the type of & arr [0] is int *, so the type of the expression arr is also int *

(3) & arr + 1, where the expression arr is of the array type int [4];

& Arr is the pointer type pointing to the array int [4], that is, int (*) [4].

We can make a bold guess that the C designer considers the running validity rate

The array name arr is treated as the first address of the 0th array element arr [0 ". For example, in the following function call "f (bigArr)", if bigArr represents the content of an array, We need to transmit 4000 bytes of data when passing parameters, this requires a lot of stack space, and it takes a lot of time to copy a large amount of data. In this case, if bigArr in f (bigArr) is treated as the address of bigArr [0] by the C compiler, you only need to pass an address, in addition, the int num [1000] parameter of function f can be implicitly adjusted from the C compiler to int * num. However, this does not completely prevent C programmers from passing the array content. C programmers can still write the following struct Container by passing a struct Container object to function k, the C compiler will still copy the array data.

int bigArr[1000];

void f(int num[1000]){

}

void g(void){

f(bigArr);

}

struct Container{

int data[1000];

};

void k(struct Container d){

}

 

From the semantic consistency point of view, at the C language level, it is actually a good design to let the array name bigArr represent the content in the array, which may be more in line with the "provision mechanism, instead of policy, the C compiler provides various mechanisms for passing parameters. As for which method should C programmers choose, it may be better for C programmers to decide based on the context of the application, the following function declarations are shown in h1, h2, and h3. This may be related to Design Aesthetics. However, when a decision has become a standard, we must strictly abide by it.

Void h1 (int arr [1, 1000]);

Void h2 (int * arr );

Void h3 (int (* ptr) [1000]);

After understanding the functions such as Offset in Figure 5.2.10, let's take a look.

Expression translation in tranexpr. c is much easier. In the next section, we will discuss the CheckMemberAccess function used in tranexpr. c for accessing translation struct members and the CheckPostfixExpression function used for accessing translation array elements.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.