Program optimization refers to the process of adjusting and improving programs by using software development tools after software programming is completed, so that programs can make full use of resources, improve operation efficiency, and reduce code size. Depending on the optimization focus, program optimization can be divided into operation speed optimization and code size optimization. Running Speed Optimization refers to reducing the number of commands required to complete a specified task through application structure adjustment and other means on the basis of fully understanding the hardware and software features. On the same processor, a speed-optimized program requires less time than an unoptimized program to complete a specified task, that is, the former has a higher running efficiency than the latter. Code size optimization refers to taking measures to minimize the amount of code required by an application while correctly completing required functions.
However, in the actual program design process, the two goals of program optimization (running speed and code size) are usually contradictory. In order to improve the program running efficiency, it is often necessary to sacrifice the storage space and increase the amount of code, for example, the methods that are often used in program design to replace the calculation and loop expansion with a look-up table can easily increase the amount of program code. In order to reduce the amount of program code and compress the memory space, it may be at the cost of reducing the program running efficiency. Therefore, before optimizing the program, determine the corresponding policies based on actual needs. In the case of insufficient processor resources, the optimization of running speed should be taken into consideration; in the case of limited memory resources, the optimization of code size should be given priority.
1. Program Running Speed Optimization
The methods for optimizing program running speed can be divided into the following categories.
1.1 General Optimization Methods
(1) Reduce the computing intensity
Use the left/right shift operation to replace the multiplication/Division 2 Operation: Generally, the power that needs to be multiplied or divided by 2 can be done by moving left or right n bits. In fact, multiplication can be replaced by any integer by shift and addition. The method of adding and shifting in ARM 7 can be completed by one command, and the execution time is less than the multiplication command. For example, I = I × 5 can be replaced by I = (I <2) + I.
Multiplication is used to replace multiplication: A 32 × 8 Multiplier is built in the ARM 7 kernel. Therefore, multiplication can be used to replace multiplication to save the overhead of multiplication function calls. For example, I = POW (I, 3.0) can be replaced by I = I × I.
Use and operation to replace the remainder operation: Sometimes you can use the and command to replace the remainder operation (%) to improve efficiency. For example, I = I % 8 can be replaced by I = I & 0x07.
(2) optimized cycle termination conditions
In a loop structure, the cycle termination condition will seriously affect the cycle efficiency, coupled with the conditional execution feature of the arm command, therefore, we should try to use the count-down-to-zero structure when writing the ending condition of the loop. In this way, the compiler can use a BNE (jump if non-zero) command to replace the CMP (comparison) and ble (jump if smaller than 0) commands, which reduces the code size and speeds up the running.
(3) Use the inline function
Arm C supports the inline keyword. If a function is designed as an inline function, the function body replaces the function call statement where it is called, this will completely Save the overhead of function calls. The biggest disadvantage of using inline is that when a function is frequently called, the amount of Code increases.
1.2 processor-Related Optimization Methods
(1) Keep the assembly line open
We can see from the previous introduction that the delay or blocking of the pipeline will affect the performance of the processor, so we should try our best to keep the pipeline smooth. The pipeline delay is hard to avoid, but other operations can be performed using the delay period.
The auto-indexing function in the load/store command is designed to take advantage of the pipeline delay period. When the pipeline is in a delay period, the execution unit of the processor is occupied, while the arithmetic logic unit (ALU) and the bucket-type positioner may be idle, in this case, you can use them to add an offset to the base register,
For subsequent instructions. For example, the command LDR R1, [R2], #4 completes the R1 = * R2 and r2 + = 4 operations. This is an example of post-indexing; the command LDR R1, [R2, #4]! Completing the R1 = * (r2 + 4) and r2 + = 4 operations is an example of pre-indexing.
Pipeline blocking can be improved by means of cyclic disassembly. To reduce the proportion of Jump commands in cyclic commands, you can consider disassembling a loop to improve code efficiency. The following describes a memory replication function.
Void memcopy (char * To, char * From, unsigned int nbytes)
{
While (nbytes --)
* To ++ = * From ++;
}
For simplicity, assume that nbytes is a multiple of 16 (the remainder is omitted ). Each time the above function processes a byte, it must make a judgment and jump. The loop body can be split as follows:
Void memcopy (char * To, char * From, unsigned int nbytes)
{
While (nbytes ){
* To ++ = * From ++;
* To ++ = * From ++;
* To ++ = * From ++;
* To ++ = * From ++;
Nbytes-= 4;
}
}
In this way, the number of instructions in the loop body increases, but the number of loops decreases. The negative effects of redirection commands are weakened. The above code can be further adjusted using the 32-bit font length of the ARM 7 processor as follows:
Void memcopy (char * To, char * From, unsigned int nbytes)
{
Int * p_to = (int *);
Int * p_from = (int *) from;
While (nbytes ){
* P_to ++ = * p_from ++;
* P_to ++ = * p_from ++;
* P_to ++ = * p_from ++;
* P_to ++ = * p_from ++;
Nbytes-= 16;
}
}
After optimization, a loop can process 16 bytes. The impact of redirection commands is further weakened. However, we can see that the adjusted Code has increased in terms of the amount of code.
(2) Use register variables
The CPU accesses the registers much faster than the memory. Therefore, assigning a register to the variables will help optimize the code and improve the running efficiency. Registers can be allocated to variables of integer, pointer, and floating point types. Some or all of a structure can also be allocated to registers. Assign registers to variables that require frequent access in the loop body.
Improve program efficiency to a certain extent.
1.3 optimization methods related to instruction sets
Sometimes the program can be optimized based on the characteristics of the ARM7 instruction set.
(1) Avoid Division
There are no division commands in the ARM 7 instruction set. The Division is implemented by calling the C library function. A 32-bit Division usually requires 20 ~ 140 clock cycles. Therefore, division becomes a bottleneck in program efficiency and should be avoided as much as possible. Some division can be replaced by multiplication. For example, if (x/y)> Z can be changed to If (x> (Y x z )). In addition to satisfying precision and memory space
In the case of redundancy, you can also consider using the look-up table method instead of division. When the divisor is the power of 2, the Division is replaced by the shift operation.
(2) conditional execution
An important feature of the arm instruction set is that all commands can contain an optional condition code. When the conditional flag in the Program Status Register (PSR) meets the specified conditions, the command with the conditional code can be executed. Conditional execution usually removes the need for separate judgment commands, which can reduce code size and improve program efficiency.
(3) Use appropriate variable types
The arm Instruction Set supports signed/unsigned 8-bit, 16-bit, 32-bit integer, and floating point variables. Proper use of variable types not only saves code, but also improves code running efficiency. Avoid using char and short local variables as much as possible, because operations on 8-bit/16-bit local variables often require more commands than operations on 3-2 variables, compare the following three functions with their assembly code.
Intwordinc (INTA) wordinc
{Add A1, A1, #1
Return A + 1; MoV PC, LR
} Shortinc
Shortshortinc (shorta) add A1, A1, #1
{Mov A1, A1, LSL #16
Return A + 1; MoV A1, A1, ASR #16
} Mov PC, LR
Charcharinc (chara) charinc
{Add A1, A1, #1
Return A + 1; and A1, A1, # & FF
} Mov PC, LR
It can be seen that the commands required to operate 3 and 2 variables are less than 8 and 16 variables.
1.4 memory-Related Optimization Methods
(1) Replace the calculation with a look-up table
When the CPU resources are tight and the memory resources are relatively rich, you can sacrifice the storage space in exchange for the running speed. For example, if you need to frequently calculate the sine or cosine function value, you can pre-calculate the function value and place it in the memory for future search.
(2) Make full use of the on-chip RAM
Some manufacturers produce ARM chips with a certain amount of RAM, such as at91r40807 with kB RAM, and sharp's lh75400/lh75401 with 32 kB RAM. The access speed of the processor to the on-chip RAM is faster than that to the external Ram. Therefore, you should try to transfer the program to the on-chip RAM for operation. If the program is too large to be fully loaded into the on-chip RAM, you can consider transferring the most frequently used data or program segment to the on-chip RAM to improve the program running efficiency.
1.5 compiler-Related Optimization Methods
Most compilers support Optimization of program speed and program size. Some compilers also allow users to select the content and degree of optimization that can be optimized. Compared with the previous optimization methods, it is a simple and effective way to optimize the program by setting compiler options.
2. Code Size Optimization
One important feature of a simplified instruction set computer is that the instruction length is fixed. This can simplify the Instruction Decoding process, but it is easy to increase the code size. To avoid this problem, consider the following measures to reduce the amount of code.
2.1 use multi-register operation commands
The multi-register Operation Command LDM/STM in the arm instruction set can load/store multiple registers, which is very effective in saving/restoring the status of the register group and copying large data blocks. For example ~ The content of R12 and R14 is stored in the stack. If the STR command is used, a total of 10 messages are required, and an stmea R13 !, {R4 ?? R12, R14} commands can achieve the same purpose, saving a considerable amount of instruction storage space. However, although one LDM/STM command can replace multiple LDR/STR commands, it does not mean that the program running speed is improved. In fact, the processor still splits the LDM/STM command into multiple separate LDR/STR commands for execution.
2.2 Reasonably arrange the variable sequence
The ARM 7 processor requires that 32-bit/16-bit variables in the program be aligned by word/half-word, which means that if the variable order is unreasonable, storage space may be wasted. For example, four 32-bit int-type variables in a struct i1 ~ I4 and 4 8-bit char variables C1 ~ C4, if it is stored in the staggered order of I1, C1, I2, C2, I3, C3, I4, and C4, the alignment of integer variables will cause the 8-bit char variable in the middle of the two integer variables to actually occupy 32-Bit Memory, which leads to a waste of storage space. To avoid this situation, int and char variables should be stored consecutively in a sequence similar to I1, I2, I3, I4, C1, C2, C3, and C4.
2.3 use the thumb command
To fundamentally reduce the code size, arm developed a 16-bit thumb instruction set. Thumb is an extension of the ARM architecture. The thumb instruction set is a set of 16-Bit Width commands commonly used 32-bit arm commands. During execution, transparent real-time decompression of 16-bit commands into 32-bit arm commands without any performance loss. In addition, there is no overhead for switching between the thumb state and the arm state. Compared with the equivalent 32-bit arm code, the thumb code saves more than 35% of the memory space.
Conclusion
To sum up, the optimization process is a process that makes full use of hardware resources and constantly adjusts the program structure to make it more reasonable, with a thorough understanding of the software/hardware structure and features. The goal is to maximize the processor performance, maximize the use of resources, and maximize the performance of programs on specific hardware platforms. As Arm processors are widely used in communications, consumer electronics, and other industries, optimization technology will play an increasingly important role in ARM processor-based programming.
It is worth noting that program optimization is usually only one of the goals to be achieved in software design. optimization should be carried out without affecting Program correctness, robustness, portability and maintainability. One-sided pursuit of program optimization often affects robustness, portability, and other important goals.