Optimization Program Performance (II)

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

In the previous article, we talked about how to minimize the number of operations in a computing may not necessarily improve its performance. Now, let's find out why this problem occurs.

Processor Architecture

In a computer's processor, processing a command involves many operations, which can be divided into fetch, decode, execute, and memory) write back and updateProgramCounters (PC update) and other stages. These stages can be performed simultaneously on the pipeline, as shown in:

In, F, D, E, M, and W represent the above five stages. Of course, modern processors are much more complicated than this example, but the principles are the same.

Double-precision floating-point multiplication: latency 5 emission time 1
Double-precision floating-point addition: delay 3 emission time 1
Single-precision floating-point multiplication: latency 4 emission time 1
Single-precision floating-point addition: delay 3 emission time 1
Integer multiplication: delay 3 emission time 1
Integer addition: delay 1 emission time 0.33

The above shows the performance of some arithmetic operations of Intel core i7. These times are representative for other processors. Each operation is characterized by two cyclic count values:

Latency indicates the total time required to complete the operation.
Issue time indicates the minimum number of clock cycles required between two consecutive operations of the same type.

We can see that in most forms of arithmetic operations, the transmit time is 1, which means that the processor can start a new operation in each clock cycle. This short launch time is achieved through the use of pipelines. Streamlined functional units are implemented into a series of stages, each of which completes part of the operation. For example, a typical floating-point calculator contains three stages (so there are three cycles of latency ):

Processing exponent Value
Add decimal places
Round the result

Arithmetic Operations can pass each stage consecutively without waiting for an operation to complete before starting the next step. This function can be used only when the operation to be executed is continuous and logically independent. The function unit with the emission time of 1 is called fully streamlined (fully pipelined): Each clock cycle can start a new operation. The launch time of integer addition is 0.33, because the hardware has three fully streamlined functional units that can execute integer addition. The processor has the ability to execute three additions per clock cycle.

The above content comes from an in-depth understanding of computer systems (formerly known as the 2nd version). For more details, see the book, especially Chapter 4 processor architecture and Chapter 5 program performance optimization ". This articleArticleTwo of the discussionsAlgorithmIt comes from the book's "exercise question 5.5" and "exercise question 5.6 ".

Analysis of poly Functions

Below is the c-language source program of the poly function.Code:

 Double poly (double A [], double X) {double result = 0, P = 1; for (INT I = 0; I <n; I ++, p * = x) Result + = A [I] * P; return result ;}

In the opensuse 12.1 operating system, we use the objdump-d a. out command to disassemble the test program in the previous article. The compilation code of the poly function is as follows:

0000000000400640 <poly >:400640: 66 0f 57 D2 xorpd % xmm2, % xmm2 400644: 31 C0 XOR % eax, % eax 400646: f2 0f 10 0d 92 01 00 movsd 0x192 (% rip), % xmm1 #4007e0 <_ io_stdin_used + 0x10> 40064d: 00 40064e: 66 90 xchg % ax, % ax 400650: F2 0f 10 1C 07 movsd (% RDI, % rax, 1), % xmm3 400655: 48 83 C0 08 add $0x8, % Rax 400659: 48 3D A8 60 2f 0b CMP $0xb2f60a8, % Rax 40065f: F2 0f 59 D9 mulsd % xmm1, % xmm3 400663: F2 0f 59 C8 mulsd % xmm0, % xmm1 400667: f2 0f 58 D3 addsd % xmm3, % xmm2 40066b: 75 E3 JNE 400650 <poly + 0x10> 40066d: 66 0f 28 C2 movapd % xmm2, % xmm0 400671: c3 retq 400672: 66 66 66 66 2E 0f data32 data32 data32 nopw % CS: 0x0 (% rax, % rax, 1) 400679: 1f 84 00 00 00 00

We can see that the poly function starts from the address 0x400640, which is consistent with the running result of the test program in the previous article. We will analyze the code corresponding to the loop statement from 0x400650 to 0x40066b:

# For (INT I = 0; I <n; I ++, p * = x) Result + = A [I] * P; # I in % rax, a In % RDI, X in % xmm0, P in % xmm1, result in % xmm2, Z in % xmm3400650: movsd (% RDI, % rax, 1 ), % xmm3 # z = A [I] 400655: add $0x8, % Rax # I ++, for 8-byte pointer400659: CMP $0xb2f60a8, % Rax # Compare N: i40065f: mulsd % xmm1, % xmm3 # z * = p400663: mulsd % xmm0, % xmm1 # P * = x400667: addsd % xmm3, % xmm2 # result + = z40066b: JNE 400650 <poly + 0x10> # If ! =, Goto Loop

In the x86-64 architecture, % rax, % RDI is a 64-bit register, % xmm0, % xmm1, % xmm2, % xmm3 is a 128-bit floating point register.

In this example:

The integer cyclic variable I is stored in the % Rax register.
The address of the first element of Double Precision Floating Point array a is stored in the % RDI register. Note that this address is a 64-bit pointer, which is an integer rather than a floating point value.
Double-precision float input parameter X is stored in the % xmm0 register.
The intermediate Variable P is stored in the % xmm1 register.
The final result is stored in the % xmm2 register.
In addition, the gcc c compiler uses a temporary variable z, which is stored in the % xmm3 register.

The meaning of the immediate number in the above Code:

0x08: Used in the Add command to add 8-byte characters to % Rax to achieve I ++.
0xb2f60a8: Used in CMP commands. It is equal to 187654312. Divided by 8-byte, It is 23456789, that is, the value of N.
0x400650: Used in the JNE command to specify the jump destination.

We can see that the performance-limiting calculation here is the repeated calculation expression P * = x. This requires a double-precision floating-point multiplication (five clock cycles), and until the previous iteration is completed, the calculation of the next iteration can begin. The expression z * = P must be calculated between two consecutive iterations. This requires a double-precision floating-point multiplication (five clock cycles) and the calculation expression result + = Z, this requires a Double Precision Floating Point addition (three clock cycles ). These three expressions involved in floating-point operations can be computed simultaneously on the pipeline. In the end, five clock cycles are required to complete a loop iteration.

In this assembler, the C language compiler makes full use of the instruction-Level Parallelism capability provided by the processor and executes multiple commands at the same time to optimize program performance.

Analysis of polyh Functions

The following is the c-language source code of the polyh function:

 Double polyh (double A [], double X) {double result = 0; For (INT I = n-1; I> = 0; I --) result = Result * x + A [I]; return result ;}

The corresponding assembly language code is as follows:

0000000000400680 <polyh>: 400680: 66 0f 57 C9 xorpd % xmm1, % xmm1 400684: 31 C0 XOR % eax, % eax 400686: 66 2E 0f 1f 84 00 00 nopw % CS: 0x0 (% rax, % rax, 1) 40068d: 00 00 00 400690: F2 0f 59 C8 mulsd % xmm0, % xmm1 400694: f2 0f 58 8C 07 A0 60 addsd 0xb2f60a0 (% RDI, % rax, 1), % xmm1 40069b: 2f 0b 40069d: 48 83 E8 08 Sub $0x8, % Rax 4006a1: 48 3D 58 9f D0 F4 CMP $0xfffffffff4d09f58, % Rax 4006a7: 75 E7 JNE 400690 <polyh + 0x10> 4006a9: 66 0f 28 C1 movapd % xmm1, % xmm0 4006ad: C3 retq 4006ae: 66 90 xchg % ax, % ax

It can also be seen that the polyh function starts from the address 0x400680 and is consistent with the running result of the test program in the previous article. The loop statement to be analyzed is located between 0x400690 and 0x4006a7:

# For (INT I = n-1; I> = 0; I --) Result = Result * x + A [I]; # I in % rax, A in % RDI, X in % xmm0, result in % xmm1400690: mulsd % xmm0, % xmm1 # result * = x400694: addsd 0xb2f60a0 (% RDI, % rax, 1 ), % xmm1 # result + = A [I] 40069d: Sub $0x8, % Rax # I --, for 8-byte pointer4006a1: CMP $0xfffffffff4d09f58, % Rax # Compare 0: i4006a7: JNE 400690 <polyh + 0x10> # If! =, Goto Loop

The meaning of several immediate numbers in the above program:

0x8: Used in the sub command to subtract 8-byte from % Rax to achieve I.
0xb2f60a0: Used in the addsd command. It is equal to 187654304. divided by the length of 8-byte, It is 23456788, that is, the value of n-1.
0xfffffffff4d09f58: Used in the CMP command. 0xb2f60a0 is 0xfffffffffffff8, and 0x8 is equal to 0.
0x400690: Used in the JNE command to specify the jump destination of the command.

Similarly:

The integer cyclic variable I is stored in the % Rax register.
The address of the first element of Double Precision Floating Point array a is stored in the % RDI register.
Double-precision float input parameter X is stored in the % xmm0 register.
The final result is stored in the % xmm1 register.

We can see that the performance-limiting calculation here is to evaluate the values of the expressions result * = x and result + = A [I. Starting from the value of the result from the previous iteration, we must first multiply it by X (5 clock cycles required ), then add a [I] (three clock cycles are required) to it and obtain the value of this iteration. Therefore, it takes 8 clock cycles to complete a loop iteration, Which is slower than the 5 clock cycles required by the original algorithm. Note: Because the calculation of the last expression result + = A [I] requires the value of result * = X in the previous expression, the calculation of the two expressions cannot be performed simultaneously in the pipeline. Because of data dependency, the script-level parallel capability provided by the processor cannot be used to optimize program performance.

Conclusion

Optimizing program performance is not a simple task. You must understand the core concepts of computer systems. Modern computers use complex technologies to process machine-level programs, execute many commands in parallel, and the execution order may be different from the order they appear in the program. Programmers must understand how these processors work to tune their programs for maximum speed. We strongly recommend that you thoroughly understand the computer system (formerly known as MySQL 2nd.

References

In-depth understanding of computer systems (formerly known as MySQL 2nd)
Wikipedia: instruction-level paralelism
Wikipedia: data dependency

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Optimization Program Performance (II)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support