Outline of undefined behaviors in C Language
Christopher Cole: a glimpse of undefined behavior in c
A few weeks ago, one of my colleagues came to my desk with a programming question. Recently, we have been asking each other about the C language knowledge, So I smiled and took the courage to face the coming hell.
He wrote several lines of code on the whiteboard and asked what the program would output?
#include
int main(){ int i = 0; int a[] = {10,20,30}; int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++]; printf("%d\n", r); return 0;}
It looks quite simple and clear. I have explained the priority of operators-suffix operations are computed first than multiplication, multiplication is calculated first, and multiplication and addition are combined from left to right, so I captured the operator number and began to write the formula.
int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];// = a[0] + 2 * a[1] + 3 * a[2];// = 10 + 40 + 90;// = 140
After I wrote down the answer with pride, my colleagues responded to a simple "no ". After thinking for a few minutes, I still got stuck. I don't quite remember the combination sequence of suffix operators. In addition, I know that order won't even change the order of value calculation here, because the combination rules will only apply to operators at the same level. However, I thought that I should try to calculate this formula based on the rule that all suffix operators are evaluated from right to left. It looks quite simple and clear.
int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];// = a[2] + 2 * a[1] + 3 * a[0];// = 30 + 40 + 30;// = 100
My colleague once again replied that the answer is still wrong. At this time, I had to admit defeat and asked him what the answer was. This short sample code was originally removed from the larger code segment he wrote. To verify his problem, he compiled and ran the larger code sample, but was surprised to find that the code was not run as expected. He deleted unnecessary steps and obtained the sample code above. He compiled the sample code with gcc 4.7.3 and output the surprising result: "60 ".
At this moment, I was fascinated. I remember that in C language, the order in which function parameters are calculated is undefined, so we thought that the suffix operator only follows a random order instead of the order from left to right. We are still confident that the suffix-specific method and multiplication have a higher operation priority, so we will soon prove ourselves that there is no order in which we can calculate I ++, add up and multiply the three array elements to get 60.
Now I am fascinated by this. My first thought was to check the disassembly code of the code and try to find out what actually happened. I compiled this sample code with the debug symbol (debugging symbols). After using objdump, I quickly got the annotated x86_64 disassembly code.
Disassembly of section .text: 0000000000000000
:#include
int main(){ 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 48 83 ec 20 sub $0x20,%rsp int i = 0; 8: c7 45 e8 00 00 00 00 movl $0x0,-0x18(%rbp) int a[] = {10,20,30}; f: c7 45 f0 0a 00 00 00 movl $0xa,-0x10(%rbp) 16: c7 45 f4 14 00 00 00 movl $0x14,-0xc(%rbp) 1d: c7 45 f8 1e 00 00 00 movl $0x1e,-0x8(%rbp) int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++]; 24: 8b 45 e8 mov -0x18(%rbp),%eax 27: 48 98 cltq 29: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx 2d: 8b 45 e8 mov -0x18(%rbp),%eax 30: 48 98 cltq 32: 8b 44 85 f0 mov -0x10(%rbp,%rax,4),%eax 36: 01 c0 add %eax,%eax 38: 8d 0c 02 lea (%rdx,%rax,1),%ecx 3b: 8b 45 e8 mov -0x18(%rbp),%eax 3e: 48 98 cltq 40: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx 44: 89 d0 mov %edx,%eax 46: 01 c0 add %eax,%eax 48: 01 d0 add %edx,%eax 4a: 01 c8 add %ecx,%eax 4c: 89 45 ec mov %eax,-0x14(%rbp) 4f: 83 45 e8 01 addl $0x1,-0x18(%rbp) 53: 83 45 e8 01 addl $0x1,-0x18(%rbp) 57: 83 45 e8 01 addl $0x1,-0x18(%rbp) printf("%d\n", r); 5b: 8b 45 ec mov -0x14(%rbp),%eax 5e: 89 c6 mov %eax,%esi 60: bf 00 00 00 00 mov $0x0,%edi 65: b8 00 00 00 00 mov $0x0,%eax 6a: e8 00 00 00 00 callq 6f
return 0; 6f: b8 00 00 00 00 mov $0x0,%eax} 74: c9 leaveq 75: c3 retq
The first and last commands only establish the stack structure, initialize the value of the variable, call the printf function, and return the result from the main function. Therefore, we only need to care about the commands from 0x24 to 0x57. This is where the behavior is interesting. Let's check several commands each time.
24: 8b 45 e8 mov -0x18(%rbp),%eax27: 48 98 cltq 29: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx
The first three commands are as expected. First, it loads the I (0) value to the eax register, expands it with a symbol to 64 bits, and then loads a [0] To The edx register. The multiplication of 1 (1 *) Here is obviously removed by the compiler after optimization, but everything looks normal. The following commands are roughly the same at the beginning.
2d: 8b 45 e8 mov -0x18(%rbp),%eax30: 48 98 cltq 32: 8b 44 85 f0 mov -0x10(%rbp,%rax,4),%eax36: 01 c0 add %eax,%eax38: 8d 0c 02 lea (%rdx,%rax,1),%ecx
The first mov command loads the I value (still 0) into the eax register, expands it with a symbol to 64 bits, and then loads a [0] into the eax register. An interesting thing happened. We hope I ++ has run these three commands again, but maybe the last two Commands will use some compilation magic to get the expected results (2 * a [1]). These two commands Add the value of the eax register one time, actually execute the 2 * a [0] operation, then add the result to the previous calculation result, and coexist into the ecx register. The command has obtained the value of a [0] + 2 * a [0. It seems a bit strange at first, but again, maybe a compiler magic is happening.
3b: 8b 45 e8 mov -0x18(%rbp),%eax3e: 48 98 cltq 40: 8b 54 85 f0 mov -0x10(%rbp,%rax,4),%edx44: 89 d0 mov %edx,%eax
The following commands start to look quite familiar. They load the I value (still 0), carry the symbol to 64-bit, load a [0] To The edx register, and then copy the value in edx to eax. Well, let's look at it more:
46: 01 c0 add %eax,%eax48: 01 d0 add %edx,%eax4a: 01 c8 add %ecx,%eax4c: 89 45 ec mov %eax,-0x14(%rbp)
Here, a [0] is automatically added three times, followed by the previous calculation results, and then saved to the variable "r ". What's incredible now -- our variable r now contains a [0] + 2 * a [0] + 3 * a [0]. Sure enough, that is, the output of the program: "60 ". But what happened to the suffix operators? They are all at the end:
4f: 83 45 e8 01 addl $0x1,-0x18(%rbp)53: 83 45 e8 01 addl $0x1,-0x18(%rbp)57: 83 45 e8 01 addl $0x1,-0x18(%rbp)
It seems that the code of our compiled version is completely wrong! Why is the suffix operator dropped to the bottom and after all tasks have been completed? As my belief in reality decreases, I decided to find the source directly. No, it's not the source code of the compiler -- it's just the implementation -- I grabbed the C11 language specification.
This problem lies in the details of suffix operators. In our case, we performed three suffix auto-increment operations on the array subscript in a single expression. When the suffix operator is calculated, it returns the initial value of the variable. Allocating new values back to variables is a side effect. The result is that the side effect is defined as being put only between the ordered points. Refer to Chapter 5.1.2.3 of the standard, where the details of sequence points are defined. However, in our example, our expression shows undefined behavior. It depends entirely on the side effect of the compiler on when to assign a new value to the variable, and it will execute other parts relative to the expression.
In the end, we both learned a little new C language knowledge. As we all know, the best application is to avoid constructing complex prefix and suffix expressions, which is an excellent example of why this is necessary.