1Compilation technology is divided into "machine-independent" and "machine-related. "It is not related to machines". When using these technologies, you may not consider executingCodeAnd "machine-related" means that these technologies depend on the low-level details of many machines.
2Least Square Method fitting
3, Optimization: Eliminate cycle inefficiency
This is called code Movement (Code Motion).
4To reduce unnecessary process calls. If the border security can be ensured, the border security check does not need to be performed every time.
5Eliminate unnecessary memory references
Sample Code
Void test1 (int * ptoint) {* ptoint = 0; For (INT I = 0; I <10; I ++) {* ptoint + = I ;}} // The efficiency of the above writing method is much lower than that of the following when the data volume is large, because * ptoint involves memory references. Void Test2 (int * ptoint2) {int itemp = 0; For (INT I = 0; I <10; I ++) {itemp + = I ;} * ptoint = itemp ;}
Introduce temporary variables to save intermediate results. The result is stored in an array or global variable only when the final value is calculated. Through optimization, the compiler uses the register eax (usually) to store the results of intermediate variables. (View the assembly code)
6Modern processor Structure
Amount exceeding the standard (Superscalar): You can execute multiple operations in an out-of-order manner in each clock cycle (Out of order). The command execution order does not have to be in the AssemblyProgramIn the same order.
The whole design has two main parts:ICU(Instruction Control Unit, Command control unit) andEU(Execution Unit, Execution Unit ). The former reads instruction sequences from memory and generates a set of basic operations on program data based on these instruction sequences. The latter performs these operations.
Retired unit (Retirement UnitRecords ongoing processing and ensures that it complies with the ordered semantics of machine-level programs.
7, Most units can start a new operation each clock cycle. The only exception is the floating point multiplier and two delimiters. The divisor is not streamlined.
Latency(Execution time)Represents the total number of cycles for a single operation.
Issue time(Launch time)Denotes the number of cycles between successive(Continuous),
Independent(Independent)Operations. (obtained from Intel literature ).
8, Reduce cycle overhead
We can reduce the impact of the cycle overhead by executing more data operations in each iteration (Loop unrolling. The idea is to access array elements and perform multiplication in an iteration.
9, InIa32On the processor, all floating point operations are extended80Bit-precision execution, and floating-point registers are stored in this format. The register value is converted32Bit (floating point number) or64Bit (double-precision format ).
10, Performance Improvement
1) Select the appropriate data structure andAlgorithm.
2) Encoding:
(1) Pay attention to the several items listed above that may cause low performance.
11Program Analysis
Program profiling (Profiling) Including a version of the running program, where the tool code is inserted to determine the time required for each part of the program.
UNIXThe system provides an profiling ProgramGPROF. For more details, you canGoogle, Or refer to5Chapter.
12,AmdahlLaw
The main idea is: when we return to the speed of a part of the system, the impact on the overall performance of the system depends on how important this part is (percentage of all) and speed increase (several times higher than the original ).
Computer Systems: A programmer's perspective >