Semantic checking of 4.2.4 function calls
In this section, let's discuss the semantic check of function calls, syntactically, the corresponding expression of a function call belongs to the suffix expression postfixexpression,ucc compiler exprchk.c function Checkfunctioncall () A semantic check of function calls was completed, as shown in 4.2.18. When reading this code, you need to understand the syntax tree after parsing for the function call, please refer to the syntax tree of the figure 3.1.21 suffix operator or preview 14.2.19 first.
Figure 4.2.18 Checkfunctioncall ()
for semantic checking of function calls of Form F (a,b,c) We need to look at the symbol table first to see if the function f has been declared, and the figure 4.2.18 the 6th and 7th lines. According to the C standard, if f is used without declaration, the function f is treated as a function declaration of the old style, which is equivalent to the declaration of F as Int f (). We've already introduced the "2.4-section C-type system," where old-fashioned style functions can cause maddening nightmares. This should also be the reason that the C + + compiler prohibits "use of functions not declared." From this point, we can again realize that C + + is not a superset of C, C + + just as far as possible to compatible with the existing C, for the real can not see the part also adopted a "abandon" strategy. The Defaultfunctiontype of line 8th represents the type of the old style function declaration of the form "int f ()", and line 9th to 11th adds the implicitly declared int F () to the global symbol table through the function addfunction. Of course, if we had previously declared the function f as an int f (int,int,int), we could find the information about the function f in the symbol table, or when we called the function through the form (*ptr) (a,b,c), the expression (*ptr) is not an expression node with an OP field of op_id, the Checkexpression function is called on line 13th to perform a semantic check on the expression F or (*ptr). After adjusting the type of the Adjust () function on line 15th, if the "F or (*PTR) corresponding node" type is not a pointer to the function, then we do error handling on line 18th to 20th, otherwise we will note the type information of the function in line 22nd. For the junction of F in Figure 4.2.19, its type is a pointer to function int (int,int,int), which is int (*) (int,int,int), so we note in Figure 4.2.18 22nd that the shape is int (int,int,int ), line 56th notes the type of the function return value, that is, the type of expression F (30,40,50) is int. For the data structure of the function type, refer to the "Figure 2.4.9 function's type Structure" in chapter 2nd.
The 24th to 39th line of Figure 4.2.18 is used to check each argument in a function call, and the main work is handled by the checkargument () of the 30th row, and the contents of the check include whether the number of arguments matches the number of formal parameters and whether the actual participation parameters match on the type. This is equivalent to checking whether the argument can be assigned to a parameter. The 40th to 55th line will give a warning or an error when the number of arguments and the number of parameters are inconsistent. For legacy style function declarations of the form int f (), the formal parameter list is not part of its function interface, i.e. the hasproto of lines 42nd and 50 are 0 (no prototype prototype). The prototype prototype means "This is the model, we have to leaf out, the template has several parameters, the call must have several arguments", at this time we modeled the Clang compiler practice, give a warning, as shown in lines 47th and 53. For a new style function declaration of the form int f (int, int,int), we have to follow the prototype function call, or we will have an error, as shown in lines 44th and 51. The code in line 36th to 39th of Figure 4.2.18 is used to check for extra arguments, such as function call F (30,40,50,60,70), because in the new style declaration int F (int,int,int), we only declare 3 parameters. Checkargument () in line 30th, when dealing with the parameters of a new style function, if it finds that 3 arguments have been checked, the variable argfull of line 27th will be 1, and then the while loop of line 29th is no longer required. Due to semantic checking, the syntax tree nodes change, or even rebuild, so we want to record the return value of Checkargument () on line 30th, and for the extra arguments, just call checkexpression () on line 37th to check At this point, you have actually encountered "the number of formal parameters inconsistent with the number of arguments" error.
Figure 4.2.19 Syntax tree for function calls
Next, we analyze the Checkargument function in line 30th of 14.2.18, as shown in 4.2.20. Line 5th is used to get the number of formal parameters of a new style of function, and the 6th line is still recursive to call the Checkexpression function for the semantic check of the argument expression, if the new style of function declaration form such as F (void), that is, there is no parameter, then the 8th line set the corresponding flag bit is 1, Indicates that the function arguments for the new style have been checked and then returned directly from line 9th. For a function declaration of the form F (int,int,int), F is not a variable parameter function, and if the actual argument being checked is the last, then the condition of line 11th is set, at which point the corresponding flag bit in line 12th is 1. For older-style functions like int f (), the formal parameter list is not part of the function interface, and we need to do an argument promotion for each argument in the function call, which is done by the promoteargument () function on line 15th. For new style functions, we need to check that the arguments can be assigned to the corresponding named parameters, and the Canassign () function of line 20th is used to make this judgment. For memory alignment reasons, the C compiler typically converts an argument that is less than int (for example, char or short) into an int, and the 23rd line performs this transformation implicitly by the compiler. However, for float type arguments in a new style function, there is no action to promote to a double type, which differs from promoteargument (). Because of the judgment of the Canassign () on line 20th, in line 25th, we can make a safe type conversion when we are assigning parameters to parameters in the actual parameter. Line 28th is used to handle the parameter function in the new style function, which is used for the operation of argument elevation for the "nameless parameter". When reading the code for Checkargument (), if you are not too clear on the type structure of the function, refer to the type structure of the figure 2.4.9 function.
Figure 4.2.20 Checkargument ()
The Canassign () function in line 20th of Figure 4.2.20 is written almost strictly according to the C standard document Ansi.c.txt section "3.3.16.1Simple assignment", which specifies the circumstances under which the assignment can be performed. The relevant code is shown in 4.2.21. We intentionally retained a semantic rule from Ansi.c.txt in line 9th to 12th, and the IF statement in line 13th implements this judgment, that is, if the left and right operands of the assignment operation are arithmetic types, then you can assign a value. The cast action that may be required when the argument is assigned to the parameter is completed in line 22nd to 25th of Figure 4.2.20.
Figure 4.2.21 Canassign ()
Depending on the left and right operand types of the assignment operation, we can perform assignment operations in the following cases, in which case the judgments are made in the following order.
(1) The two types are identical, the corresponding figure 4.2.21 the 6th line.
(2) Both are arithmetic types, corresponding to line 13th.
(3) Both are compatible pointer types, such as T1 * and T2 *, where T1 and T2 are type compatible, function Iscompatibletype () will determine the compatibility of the type, and we will analyze this function in subsequent chapters. If T1 and T2 are identical on qualifiers (that is, they have the same const or volatile). This is judged by line 16th.
(4) Both are pointer types, such as T1 * and T2 *, where T1 and T2 are consistent on qualifiers, and one of which is void, but the other cannot be a function type (that is, object types that describe data such as struct and double), corresponding to line 19th.
(5) The type of the left operand is the pointer type, and the right operand is a constant 0, corresponding to the 23rd row.
(6) Both are pointer types, corresponding to line 26th, at which point a warning is given on line 27th.
(7) One is a pointer type and the other is an integer type, but both account for the same size of memory space, corresponding to line 30th, at which point a warning is given on line 32nd.
From this, we can find that different types of struct objects cannot be assigned. Next, let's analyze
14. The cast () function used in line 2.20 23rd, as shown in the relevant code 4.2.22. The code that really constructs the op_cast node of the transformation operation creates a syntax tree node in the 31st row of Castexpression (), line 37th, the OP field of the 39th row is Op_cast, the 40th Row records the transformed node type, and the 41st Row records the pre-transformation expression. Of course, if the shape as the 34th line of the constant 3 transformation, it can be simplified at compile time, we directly take 3.0f, this work is done by the Foldcast () function, the function in the FOLD.C, it should be better understood, we no longer verbose. Figure 4.2.22 the various data types inside the UCC compiler are listed in the 48th to 53rd note, the I4 of line 2nd represents the signed integer to occupy 4 bytes, the corresponding int or long;u4 on the 32-bit system represents an unsigned integer 4 bytes, F4 represents 4 bytes of floating-point number, corresponding to FLOAT;F8 represents 8 The floating-point number of the Byte, V for Void, and b for the Block object, corresponding to the union, array, or struct object. The 55th line of optypes[] is used to record such mappings, so that by invoking the function TypeCode of line 44th, we can quickly get a type encoding that corresponds to the type structure such as I4. The type coding sequence in the 2nd line of comments is intentional, and in this order we can make the type judgments more quickly, such as the IF statement on line 8th to 9th is used to determine whether "two types are all integers that occupy the same size memory space," such as short and unsigned Short will satisfy this condition.
Figure 4.2.22 Cast ()
For integer types that occupy the same memory size, such as char and unsigned char, short and unsigned short, when a type conversion between signed char and unsigned char occurs, the data stored in the memory unit does not change. We just need to record that the type of the cell has changed. In this case, as long as we execute the code in line 17th of the graph 4.2.22, we write down the new transformed type on the corresponding syntax tree node, and this new type will affect our command selection at the time of the code generation phase. For example, for the right shift of a signed integer, we want to fill the symbol bit at the highest bit, the assembly command we selected is SAR, and for the right shift of an unsigned integer, we fill 0 at the highest bit, and the assembly command we choose is SHL. In the UCC compiler, even when the char type variable c1 is forced to float, that is (float) C1, we take two steps, the first step is to promote the char to int, and then the conversion of int to float operand. In the early days, the size of the C-type int reflects the size of the CPU's universal data register, and the CPU certainly expects the operand to be placed exactly in its data register. Figure 4.2.22 Line 20th to 23rd completes the first step from char to int, while line 28th makes the second step from int to float. Conversely, if you want to cast variable f of type float to char type, we also take two steps, the first step is a float to int conversion, the second step int to char conversion, the 25th to 28th Line completes the work of these two steps. The code for lines 5th to 7th is used to cast the expression to void, which is typically used in cases where the parameter arg is not used in the body of the function, and some compilers give a warning that one of the ways to avoid this warning is to add (void) ARG statements.
void f (int arg) {
(void) arg;
...
}
The code in line 11th to 16th of figure 4.2.22 is relatively subtle, so let's take a concrete example to explain the code, as shown in 4.2.23. In the UCC compiler, a char or short that participates in arithmetic operations is first promoted to an int type, and then an arithmetic operation, such as an arithmetic right shift to the 8th row of the figure 4.2.23, the operand s of the short type is first transformed into an int, such as the syntax tree for line 19th to 20th (cast int s), and then convert the node of type int to unsigned int as required by the C statement on line 8th. since int and unsigned int also account for 4 bytes, if there is no code for Figure 4.2.22 Line 11th to 16th, we will not call the Castexpression function to construct a syntax tree node for a op_cast operation, but only s>> 1 the type of the corresponding node is set to unsigned int, but the 4.2.23 (int) Transformation of the 8th row of the diagram will set the type of the corresponding node of the S >> 1 to Int. In the assembly code generation of the UCC compiler, such as UCL\X86.C's function emitassign (Irinst inst), we decide whether to choose SAR or SHR instruction according to the type of the intermediate Code inst, and the inst type is derived from the syntax tree node s> >1 type, we can see this when we discuss the function translatebinaryexpression () generated by the intermediate code. This causes us to mistakenly treat the type of the S >>1 node as a signed number int, thus using the arithmetic right-shift SAR in code generation, resulting in incorrect results. According to the C statement in line 8th, we should use the logical right SHIFT instruction SHR, and fill the sign bit at the highest bit 0. For this reason, when the type conversion between I4 and U4 is performed, we call the Castexpression () function of the 15th row of the graph 4.2.22 to explicitly construct a syntax tree node for a transformation operation, as shown in 4.2.23 line 18th.
Figure 4.2.23 The corresponding syntax tree for transformation
A little summary, in the UCC compiler, operands of less than int that participate in arithmetic operations (char,unsigned char,short or unsigned short) are promoted to int, an example of a variable s of type short of line 8th. Note that even with the addition of two char-type variables, such as C1+C2, we will first C1 and C2 to int, and then do the 32-bit addition. In the case of a forced type conversion, the example 4.2.23 line 9th (float) C1 and 10th Line (char) F, we also use int as the relay, 4.2.23 the syntax tree of line 22nd to 30th. using int as the operand actually means that we are always trying to make full use of the general data registers of the CPU. Understanding the cast () function, it is very easy to understand the promoteargument of the actual parameter lifting function mentioned earlier.
Static astexpressionpromoteargument (Astexpression Arg) {
Typety = Promote (arg->ty);
Returncast (Ty, arg);
}
Type Promote (type Ty) {
Return ty->categ< INT? T (INT): (Ty->categ = = FLOAT?) T (DOUBLE): Ty);
}
And for the previously mentioned function Iscompatibletype (), which is used to determine whether the two types are compatible, we will
Discussed in the continuation section. To understand this function, we need to have a perceptual understanding of the various types of structures introduced in the "2.4-section C-type System", or the phrase, which looks stupid, but it is very effective to use a paper-based notebook, The type structure we gave in verse 2.4 and the syntax tree constructed in chapter 3rd are drawn on paper, and we can read the code against these graphs so that we don't get lost in the huge grammatical tree and the complex type structure.
In the previous article, the other two more subtle functions are the Lookupfunctionid function and the 10th line addfunction function of line 7th of Figure 4.2.18. Let's discuss this in conjunction with the example in Figure 4.2.24. In the function body of function f, we declare the function h in line 3rd of Figure 4.2.24, which prevents us from declaring the variable h of type int again in line 5th. However, in the function body F, the Declaration of function H not only occupies one item in the local symbol table (thus, when the local variable h of line 5th attempts to fill in the local symbol table, we can make an error in the semantic check of the declaration), but also occupy one item in the Global symbol table (so that we can successfully invoke the 8th line h (3,4 ))。 The declaration of the function h in line 3rd is much different from the declaration of the local variable A in line 4th, and a in line 4th cannot be accessed outside the function body of function f.
Figure 4.2.24 Function declaration in the function body
Also, for the following two function declarations, the C compiler treats them as a function declaration that is compatible with compatible, and the function Iscompatibletype () is used to detect the compatibility of two types.
int g (INT (*) (), double (*) [3]);
int g (INT (*) (char *), double (*) []);
In this case, the C compiler does not error, but for the two compatible claim types, constructs a "equivalent to the maximum common factor" type, the "maximum common factor" in the C standard document is called the synthetic type Compositetype, for the above two declarations, the final synthesis of the largest common factor is as follows. The function Compositetype (ty1,ty2) in the UCC compiler type.c is used to implement this compositing operation. We will further analyze the functions related to type systems such as Iscompatibletype and Compositetype in subsequent chapters.
int g (INT (*) (char *), double (*) [3]);
Because of this subtle semantics, we expect the actual type information of the function declaration to be stored only in the global symbol table, so that when the Compositetype () function is used to change the type information of the existing function declaration G, we can just change the corresponding symbol in the global symbol table. Instead of having to change other symbol tables. In the local symbol table, we only place a placeholder so that we can detect a redefinition error as shown in the 5th line of the 4.2.24 comment. The AddFunction function in the UCC compiler implements the operation of adding functions to the global symbol table. The UCC compiler has the original ucc162 version, and there is no Lookupfunctionid function, only the LookupId function in the original code. In order to realize the semantics of the above function declaration, we can change the original code as little as possible. We added the Lookupfunctionid function in version ucc162.2, of course, it is very embarrassing to admit that the code is more obscure, this patch hit quite ugly, in the subsequent version, we need to improve. The original LookupId function retrieves the symbol table from the current symbol table, and if it is not found, then looks for the outer symbol table until the global symbol table, as shown in code 4.4.25. The query operation on the symbol table is actually done by the Dolookupsymbol function on line 1th, the 3rd line calculates the hash value, and the For loop of the 6th to 9th line is searched on the corresponding hash bucket based on the hash value. If there is a more outer symbol table, and we want to find the outer symbol table, the while condition on line 11th will be true.
Figure 4.4.25 LookupId () function
Within the UCC compiler, there are two main data structures for storing symbols, one is a hash table, one is a single-linked list, and some variables defined in UCL\SYMBOL.C are used for this purpose, as shown in 4.4.26. The hash table on line 3rd globaltags the names of struct struct, union union, and enum enum that are declared outside the function body, and these names are called tags in the C standard document. This word is often translated as "label", while the label in the statement Gotoagain again the corresponding English word as label. It is easy to confuse that the word label is also often translated into labels. The Globalids of line 5th is used to hold the global variable name and function name, and the constant (for example, 123) is stored in the hash table constants of line 7th. From this we can see that even struct names declared in the same scope, struct data and variable name int ABC, are also stored in different hash tables. The pointer to the 9th line, tags, points to the symbol table for the structure name of the current scope, and the 11th row of pointer identifiers points to the symbol table for the current scope's holding variable name and function name. For the convenience of generating code in the next phase, we will also link the symbol of the function name to the functions of the 16th row, while the global variable and the static variable are at the beginning of the chain with the 18th row of the pointer globals. The floatconstants of the strings and 23rd rows of line 21st are the chain heads of the string and floating-point numbers respectively. The functiontails, such as the 14th line, always points to the chain end of the corresponding symbol chain, which makes it easy to insert operations.
Figure 4.2.26 Data structures related to symbols
For type names established by the TypeDef keyword, such as the typedef int DATA,UCC compiler, which is shown below, also stores Data in the symbol table corresponding to the variable name A, That is, the current symbol table pointed to by identifiers in line 11th of Figure 4.2.26, and the struct name data in struct data is stored in the symbol table that is pointed to by tags in Figure 4.2.26 9th line. And in C, when using struct names, we need to bring the struct keyword, so in the following code, we can unambiguously declare the local variable A as the data type, that is, the int type, not the struct data type. In C + +, when you use struct or class names, you can not need a struct or a classes keyword, but instead cause "in data A = 3 in which data" two meaning, so the C + + compiler will error the following code.
void f (void) {
struct data{
int A;
};
typedef int DATA;
Data a = 3;
}
The function addfunction () is shown in code 4.2.27. Figure 4.2.27 Line 4th to 11th creates a symbol with the category Sk_function, and the UCC compiler adds the symbol corresponding to the function name to the global symbol table Globalids through the AddSymbol function of line 15th. The UCC compiler also uses a single-linked list to record all of the function names corresponding to the symbol, whose chain head is the functions variable on line 16th of Figure 4.2.26. Because the same symbol object might be in a single-linked list, or in a hash table, the next field is used to form a single-linked list in the struct struct Functionsymbol object corresponding to line 4th Functionsymbol, and the link field is used to form a linked list in a hash bucket.
Figure 4.2.27AddFunction ()
When retrieving the current symbol table using the Lookupfunctionid () function on line 19th of Figure 4.2.27, the parameter placeholder determines whether we need to add a placeholder to the current symbol table. Because of the presence of a placeholder, the UCC compiler detects a redefined error "error:redefinition of h" When you delete the comment number for the 5th line of the graph 4.2.24//. The current symbol table may be a global symbol table or a local symbol table, and the nested structure of the symbol table can be found in chapter 2nd, "Figure 2.5.12 Multiple scope symbol tables". Syntactically, a pair of curly braces for a C-language compound statement represents a new scope, corresponding to a new symbol table. Figure 4.2.27 Line 26th to 31st completes the addition of a symbol of type Defaultfunctiontype type function to the local symbol table. Since we are storing the actual type information of the function declaration in the Global symbol table, we will complete the retrieval of the global symbol table Globalids in lines 33rd and 38. The symbol returned by the Lookupfunctionid function is not necessarily a function type or a normal variable, and the caller of the Lookupfunctionid function determines what to do with the return value based on the context in which it is invoked. by Sourceinsight to see where the function Lookupfunctionid is being used, the context can be better understood and not repeated here. To be verbose, when reading code, it is sometimes necessary to combine the context of its invocation to better understand its meaning. function Lookupfunctionid is as a patch of the lookupid function, in the ucc162.2.tar.gz version of the paste up, barely trapped, not beautiful, and obscure is a bit of edge. Part of the reason is that in C source code there may be multiple function declarations that are compatible with each other, and we want to use only one struct symbol object to record the type information for a function name. By using the existing code of the UCC compiler, this struct symbol object is stored in only one hash table because there is only one link field in the struct symbol's structure that is used to form the list of hash buckets. If you want a struct symbol object to exist in more than one hash table at a time, we can simply not use the link domain defined in the struct symbol, but instead, when inserting a symbol through the AddSymbol () function into the hash bucket, generate a struct like the one shown below The Bucketlinker object, which is used by the Sym field to point to the struct symbol object to be inserted, while the linker field is the chain in the hash bucketTable. In a later version, perhaps we can make improvements as a follow-up.
struct bucketlinker{
struct Bucketlinker * linker;
struct symbol * SYM;
};
C Compiler Anatomy _4.2 semantic Check _ Expression semantics Check (4) _ Function call