On ARM's C code optimization in embedded development

Last Update:2018-07-29 Source: Internet

Author: User

Tags arithmetic integer division pow switches

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

July 28, 2008 22:22:02
On ARM's C code optimization in embedded development
The following is a collection of C code optimization methods on ARM on the network that should be useful in embedded development:
[Statement: The following methods are not my findings and summary, are the selfless contribution of the people, thank them for their labor and sharing. ]

=======================================================
C Data Type
1. C Program Optimization is related to compilers and hardware systems, and setting some compiler options is the most straightforward and simplest way to optimize. By default, ARMCC is valid for all optimizations, and the GNU compiler defaults to optimizations that are turned off. The char type defined in the ARM C compiler is 8-bit unsigned, unlike the generally popular compiler default char is 8-bit signed. So when you use the char variable and the condition i≥0 in the loop, a dead loop occurs. To do this, you can change the char to signed with Fsigned-char (for GCC) or-ZC (for ARMCC).
The other variable types are as follows:
Char unsigned 8-bit byte data
Short signed 16-bit half-byte data
Int signed 32-bit data
Long signed 32-bit data
Long long signed 64-bit double Word data
2. About Local variables
Most ARM data processing operations are 32-bit, and local variables should use a 32-bit data type (int or long) whenever possible, even if 8-bit or 16-bit values are processed, and you should avoid using char and short to align edges. Unless you are using char or short data for a zero-zeroing feature (such as 255+1=0, used for modulo operations). Otherwise, the compiler will add code to handle situations that are larger than the short and char range.
Also, be careful with the handling of expressions, as in the following example:
Short CHECKSUM_V3 (short * data) {
unsigned int i;
Short sum = 0;
for (i = 0; i < i++) {
sum = (short) (sum + data);
This is where the expression is shaped, so when you're dealing with non 32-bit data,
Be careful with conversion of data types.
Originally short+short=int but int +int=int. Strange Handling
}
return sum;
}
Also, as shown in the above example, each operation in the loop body is converted to a type, reducing the efficiency of the program, which can be used as an int before returning a short type.
At the same time, because the processing of data[] is a short array, with the LDRH instruction, can not use bucket shifter, so can only be offset to the operation, and then addressed, will also cause poor performance. The workaround is to use pointers instead of array operations. As follows:
Short CHECKSUM_V4 (short * data) {
unsigned int i;
int sum = 0;
for (i =; i<64; i++) {
Sun + = (data + +);
}
return (short) sum;
}
3. About function parameter types
function arguments and return values should use the int type as much as possible.
In addition, for global variables with lower frequency, use small data types to save space.

C Loop Structure
Use a meiosis to zero loop body to conserve instruction and register usage.
Use the unsigned loop count value and the condition I!= 0 stop.
If the loop body is executed at least once, use Do-while first.
Expand the loop body where appropriate.
As far as possible using the array size is 4 or 8 of the Beshu, using this multiple to expand the circulation body Register allocation
Try to limit the number of local variables used in the internal loop of the function, up to 12, so that the compiler can assign variables to registers.
The compiler can be booted to Raining Galay the importance of a variable by looking at whether it belongs to the inner loop.
Function call
The first 4 integer parameters of the function in arm are passed through registers r0, R1, R2, R3, and the subsequent integer parameters are passed through the stack. (full desceding stack).
As far as possible limit function parameters, not more than four, can also be related to the parameters of the structure of the organization passed.
The smaller called function and the calling function are placed in the same source file, and the compiler can optimize it by restricting one and then calling.
The important function of _inline with the influence of the inline performance.
Pointer alias
A local variable is used to hold the value of a common subexpression, guaranteeing that the expression is only a one-time value.
Avoid using the address of a local variable, otherwise it is less efficient to access the variable.
Arrangement of structural bodies
The small elements are placed at the beginning of the structure, and the large elements are placed at the end of the structure body.
Avoid using large structures and replace them with small structures with hierarchical words.
Manually add padding to the API's structure to improve portability.
The enumeration type should be used with caution because its size is related to the compiler.
Bit fields
Try to replace the bit field with define or enum
Using logical operations to lose bit Field Operation boundary misaligned data and byte arrangement
Avoid using boundary alignment data as far as possible;
Charx can point to arbitrary byte pairs of Zitti data, with logical operations, access to arbitrary boundaries and arranged data.
Division
A bunch of algorithms, not easy to write, in general, by substituting, with shift operations. inline functions and embedded assemblies
There's nothing to write about, that is, inlining reduces call overhead, and inline assembly improves operational efficiency. Summarize
In general, the optimization of the advanced language is related to the compiler and the hardware structure.
Hardware, ARM is generally 32-bit bus, with 32-bit speed to access data faster. Local variables and other commonly used variables to make use of the 32-bit int type, when organizing the structure body, also pay attention to the position of the element (small before large) to save space. In addition, because the arm instruction can be executed conditionally, making full use of CPSR will make the program more efficient. At the same time pay attention to the operation between the good type, minimize the transformation operation. Division and modulo operations can obtain results at the same time without adding an additional operational process, but are more cost-effective for division alone or for multiplication.
For compilers, ARMCC comply with the requirements of the Atpcs, the first to fourth parameters passed through the R0~R4, the other parameters passed through the stack, the return value with R0 transmission, therefore, in order to put most of the operation in the register to complete, the parameters are best not more than 4. In addition, the available general-purpose registers have 12, so try to control the local variables within 12, the efficiency will be improved. At the same time, because the compiler is more conservative, the pointer alias will cause redundant read operations, so try to use less. =====================================================

• Data type
o Local variables (especially loop variables) stored in registers should use 32-bit data type int (=long) whenever possible, and 8-bit variables do not save any space and time;
o even if the transmission of a 8-bit data, function parameters and return value using 32-bit class order will be more effective;
O can be used to increase the address of the pointer without the array of incremental addressing a=data[i++] less than a=* (data++);
The O division operation uses unsigned numbers faster;
o Arrays and global variables stored in memory, using small size data types whenever possible;
The O-short array avoids the use of an array base address offset because the LDRH directive does not support offset addressing;
o Memory variables and register variables are assigned to each other using explicit type conversions, while others avoid unnecessary type conversions;
• Cyclic structure
o Using reduced counting cycle is better than counting cycle, termination conditions as far as possible to write I!= 0; The starting value of the cyclic variable is variable and not equal to 0, the Do-while cycle is better (termination condition is later);
o If the circulation body is too simple, such as less than 4 cycles, can expand the circulation body (repeatedly write several times the circulation body code), lest the circulation body code is not as good as the cycle itself execution cycle long;
o Try to limit the data of local variables used in the internal loop of the function, not more than 12, so that the compiler can assign them to arm registers;
• Function call
o Limit the parameters of the function as much as possible, not more than 4. Several related parameters can be organized in a structure;
O put the smaller and called functions in a file and define the call again;
O An important function that has a greater impact on performance can be inline using _inline;
• pointer alias
o Create a new local variable to hold the expression containing the memory access, which guarantees that only the value of the expression is evaluated once, such as int a=data[n];b+=a;c+=a; Better than b+=data[n];c+=data[n];
O Avoid using the address of local variables, otherwise the access to this variable is less efficient;
• Structure arrangement
o structural elements are sorted by elements from small to large;
O Avoid the use of large structures, which can be replaced by hierarchical small structures;
Note: for ARMv4 above version
====================================================== variable definition
The 32-bit ARM processor's instruction set supports signed/unsigned 8-bit, 16-bit, 32-bit integer and floating-point variable types, not only to save code, but also to improve the efficiency of your code. According to the scope of action, C-language variables can be divided into global variables and local variables. The arm compiler usually locates global variables in the storage space, and the local variables are assigned to the universal registers.

In the global variable declaration, it is necessary to consider the optimal memory layout, so that various types of variables can be aligned with 32-bit space-bit datum, thus reducing the waste of unnecessary storage space and improving the running efficiency. Such as:

The four variable forms defined here are the same, but in a different order, resulting in different data layouts in the final image, as shown in Figure 1. Obviously the second way saves more memory space.

Figure 1 Layout of variables in the data area

For local variables, try not to use a variable type other than 32 bits. When a function has a small number of local variables, the compiler assigns local variables to the internal registers, each variable accounting for a 32-bit register. Variables such as short and char types do not only save space, but instead consume more instruction cycles to perform short and char access operations. The C language code and its compilation results are as follows:

Conditional execution
Conditional execution is an essential basic operation in a program. A typical conditional execution code sequence begins with a comparison instruction followed by a series of related execution statements. The conditional execution in arm is realized by judging the result of the operation, and the result of the N and z sign bit is the same as the result of the comparison statement. Although there is no directive in C language, but in the C language program for ARM, if the result of the operation is compared with 0, the compiler will remove the comparison instruction and realize the operation and judgment through a sign-bit instruction. For example:

Therefore, the condition judgment of the C language program oriented to arm should use the form of "comparing with 0" as far as possible. In C language, most of the conditional execution statements are applied in the judgment of if condition, also in the complex relational operation (<,==,>) and the bitwise operation (&&,!,and, etc.). For arm-oriented C language programming, the symbolic variable should take the relational operation of X<0, X>=0, x==0 and x!=0 as far as possible; the x==0, x!=0 (or x>0) relational operator should be used for unsigned variables. The compiler can optimize the execution of the condition.

For conditional statements in programming, if and else judgment conditions should be simplified as far as possible. Unlike the traditional C language Program design, the similar conditions should be lumped together in the C language program oriented to arm, so that the compiler can optimize the judgment condition.

Cycle
Loop is a very common structure in programming. In embedded system, the proportion of microprocessor execution time running in the loop is large, so it is necessary to pay attention to the execution efficiency of the loop. In addition to the process of simplifying the nuclear cycle body as much as possible in ensuring that the system works correctly, it is also important to have the correct and efficient loop-end flag conditions. According to the above "compare with 0" principle, the loop end condition in the program should be "reduced to 0" cycle, the end condition is as simple as possible. As far as possible in the critical cycle to take the above form of judgment, so that the key cycle to omit some unnecessary comparison statements, reduce unnecessary overhead, improve performance. As in the following two examples:

Fact1 and Fact2 reduce load/store operations on n by defining local variable A. The Fact2 function follows the "compare with 0" principle, eliminates the comparison instruction in the Fact1 compilation result, and, the variable n does not participate in the operation in the entire loop process, also does not need to save. Because the register allocation is omitted, it brings convenience to other parts of the program, and improves the running efficiency.

The "Minus 0" method also works with the while and do statements. If a loop body only cycles several times, it can be expanded to improve the efficiency of the operation. When the loop is expanded, the loop counter and the associated jump statements are not needed, although the length of the code increases, but the execution efficiency is higher.
Division and remainder
The arm instruction set does not provide integer division, and division is implemented by code in the C-language function library (symbolic _rt_sdiv and unsigned _rt_udiv). A 32-digit division requires a 20~140 cycle, dependent on the numerator and denominator values. The time used for a division operation is a time constant that takes time to divide by each bit:

Time (numerator/denominator) =c0+c1xlog2 (numerator/denominator)
=C0+C1X (log2 (molecular)-log2 (denominator))
Since the execution period of division is long and the resources are much more, the division should be avoided in the program design. Here are some workarounds to avoid calling division:

(1) In some specific programming, division can be rewritten as multiplication. For example: (x/y) >z, where y is known to be positive and yxz is an integer, it can be written as x> (ZXY).

(2) Use 2 of the secondary as possible divisor, the compiler uses the shift operation to complete division, such as 128 is more suitable than 100. In programming, the use of unsigned division is faster than the symbolic type of division.

(3) One of the purposes of using a redundancy operation is to perform a modular calculation, which can sometimes be done using the IF judgment statement, taking into account the following applications:

Uintcounter1 (Uintcount) uintcounter2 (Uintcount)

{{return (++count '); if (++count>=60)}count=0;
return (count);}

(4) For some special division and residual operations, the use of Look-up table method can also get a good running effect.

When dividing certain constants, writing a specific function to do so is much more efficient than the code generated by the compilation. ARM's C language library has two such symbols and unsigned numbers divided by 10 function, to complete the fast operation of decimal numbers. In the examples\explasm\div.c and examples\thumb\div.c files of the Toolkit subdirectory, there are arm and thumb versions of these two functions.
=======================================================
1 Program running speed optimization
The program running Speed optimization method can be divided into the following arm several major categories.
1.1 General optimization methods
(1) Reduce the operation strength

Use the left/right shift operation instead of the multiply/divide 2 operation: a power exponent that usually needs to be multiplied by arm or divided by 2 can be completed by moving the n bit to the left or right. In fact multiplying any integer can replace multiplication with shift and addition. The addition and shift in ARM 7 can be done with a single instruction, and the execution time is less than the multiplication instruction. For example: i = ix5 can be replaced with i = (i<<2) + i.
The multiplication is substituted for the exponentiation: the 32x8 arm multiplier is built in the ARM7 kernel, so it is possible to replace the exponentiation operation by multiplication operation to save the cost of the exponentiation function call. For example: i = POW (i, 3.0) can be replaced by i = Ixixi.
Replaces the remainder operation with operations: sometimes the efficiency can be improved by replacing the remainder operation (%) with the (and) instruction. For example: i = i% 8 can be replaced with i = i & 0x07.
(2) Optimize cycle termination arm condition
In a cyclic structure, the termination condition of the cycle will seriously affect the efficiency of the cycle, plus the performance of the arm instruction, so the COUNT-DOWN-TO-ZERO structure should be used as far as possible in the termination condition of the writing cycle. This allows the compiler to replace the CMP (compare) and ble (if less than the jump) two instructions with a bne (if not zero) instruction, reducing both the code size and the running arm speed.

(3) using the inline function
Arm C Support inline keyword, if a function is designed arm into a inline function, then call it will use the function body to replace the function call statement, this will completely eliminate the overhead of the function call. The biggest disadvantage of using inline is that when a function is called frequently, the amount of code increases.

1.2 Processor-related optimized arm method
(1) Keep the pipeline smooth
As you can tell from the previous introduction, pipelining delay or blocking will have an impact on the performance of the processor, so you should try to keep the assembly line open. Pipelining delay is unavoidable, but other arm operations can be performed using a delay cycle.

The automatic indexing (auto-indexing) function in the load/store instruction is designed to take advantage of the arm pipelining delay cycle. When the assembly line is in a delay cycle, the processor's execution unit is occupied, and the arithmetic logic unit arm (ALU) and bucket shifter may be idle, and they can be used to perform an offset operation to the base register.
For the following instructions to use. For example: Instruction LDR R1, [R2], #4 complete r1= *r2 and R2 + + 42 operations, is an example of the post index (post-indexing), and the command LDR R1, [R2, #4]! Complete R1 = * (R2 + 4) and R2 +=4 Two operations, is an example of the former index (pre-indexing).

The pipeline block can be improved by means of cyclic dismantling. A loop can consider dismantling to reduce the proportion of a jump instruction in a circular instruction, thereby increasing the efficiency of the Code. The following is explained by arm with a memory copy function.

void Memcopy (char *to, char *from, unsigned int nbytes)
{
while (nbytes--) ARM
*to++ = *from++;
}

For simplicity, this assumes that the nbytes is a 16 arm multiplier (omitting the processing of the remainder). The above function is to make a judgment and a jump for each byte processed, and the loop body can be disassembled as follows:

void Memcopy (char *to, char *from, unsigned int nbytes)
{
while (nbytes) {
*to++ = *from++;
*to++ = *from++; Arm
*to++ = *from++;
*to++ = *from++;
Nbytes-= 4;
}
}

As a result, the number of instructions in the looping body increases and the number of cycles decreases. The negative effects of the jump instruction arm are weakened. Using the ARM 7 processor 32-bit length feature, the code above can be further adjusted as follows:

void Memcopy (char *to, char *from, unsigned int nbytes) ARM
{
int *p_to = (int *) to;
int *p_from = (int *) from;
while (nbytes) {
*p_to++ = *p_from++;
*p_to++ = *p_from++;
*p_to++ = *p_from++;
*p_to++ = *p_from++;
Nbytes-= 16;
}
}
After optimization, a loop can handle 16 bytes. The jump instruction brings the impact of arm further weakened. However, you can see that the adjusted code increases in the amount of code.

(2) Using register variables
CPU access to registers is much faster than access to memory arm, so assigning a register to a variable will help improve code optimization and operational efficiency. Variables of type integer, pointer, floating-point, and so on can be assigned registers, and part or all of a structure can also allocate registers. Variable allocation registers that require frequent access to the loop body can also be
Improve the efficiency of the program to some extent.

1.3 Instruction Set Correlation optimization method
At times, the program arm can be optimized using the characteristics of the ARM7 instruction set.
(1) Avoid Division
The ARM 7 instruction set has no division instruction, and its division is implemented by invoking the C library function. A 32-bit division usually requires a 20~140 clock cycle. As a result, division becomes a bottleneck in the efficiency of the program and should be avoided as much as possible. Some divisions can be substituted for multiplication, for example: if ((x/y) > Z) may be modified to if (X > (YXZ)). In the ability to meet the accuracy, and memory space
In the case of redundancy, the Look-up table method may also be considered in place of division. When the divisor is a 2 arm power exponent, the shift operation is used instead of division.

(2) The use of conditions to implement
An important feature of the ARM instruction set is that all instructions can contain an optional conditional code. When a conditional code flag in a program state register (PSR) satisfies a specified condition, an instruction with a conditional code can be executed. The use of conditional execution usually eliminates the individual decision arm instruction, thus reducing code size and increasing program efficiency.

(3) Use the appropriate variable type
The ARM instruction set supports signed/unsigned 8-bit, 16-bit, 32-bit integer and floating-point variables. The appropriate use of the type of variable, not only can save code, and can improve the efficiency of code operation. You should avoid using char and short arm local variables as much as possible, because manipulating 8-bit/16-bit local variables often requires more instructions than manipulating 3 2-bit variables, comparing the following 3 functions and their assembly code.

Intwordinc (INTA) wordinc
{ADD a1,a1, #1
return a + 1; MOV PC,LR
} shortinc
Shortshortinc (Shorta) ADD a1,a1, #1
{MOV A1,A1,LSL #16
return a + 1; MOV A1,a1,asr #16ARM
MOV PC,LR
Charcharinc (Chara) charinc
{ADD a1,a1, #1
return a + 1; and A1,A1,#&FF
MOV PC,LR
As you can see, the instructions needed to manipulate 3 2-bit variables are less than the Operation 8-bit and 16-bit variables.

1.4 Memory-related optimization methods
(1) using look-up table instead of calculation
In a situation where processor resources are tight and memory resources are relatively rich, it is possible to swap storage space for running speed. For example, if a sine or cosine function value is frequently calculated, the function value can be computed in advance and placed in memory for later arm lookups.

(2) Make full use of the RAM in the chip
Some manufacturers of ARM chips in the integration of a certain amount of RAM, such as Atmel Company's at91r40807 has 128KB of RAM, sharp company lh75400/lh75401 has 32KB of RAM. The processor accesses the RAM more quickly than it accesses the external RAM, so you should try to run the program into the RAM in the slice whenever possible. If the program is too large to be fully placed in the RAM, consider arm to use the most frequent data or program segments into the RAM in the chip to improve the efficiency of program operation.

1.5 Compiler-related optimization methods
Most compilers support optimization of program speed and program size, and some compilers allow users to choose the content to optimize and the degree of optimization. Compared to the previous optimization methods, it is a simple and effective way to optimize the program by setting the compiler option.

2 Code size optimization
An important feature of the simplified instruction set computer is that the instruction length is fixed, which simplifies the process of instruction decoding, but can easily lead to increased code size. To avoid this problem, consider taking the following steps to reduce the amount of program arm code.

2.1 Using multi-register operation instructions
Multi-register operation instruction in ARM instruction set ldm/stm can load/store multiple registers, which is very effective in saving/restoring the state of register group and replicating large data. For example, to save the contents of the register R4~r12 and R14 to the stack, if the str instruction altogether needs 10, and a Stmea r13!, {R4?? R12, R14} instructions can achieve the same goal, saving a considerable amount of instruction storage space. However, it should be noted that although a ldm/stm instruction can replace more than one ldr/str instruction, this does not mean that the program is running at a speed of arm improvement. In fact, the processor executes the ldm/stm instruction by splitting it into several separate ldr/str instructions.

2.2 Reasonable arrangement of variable order
The ARM 7 processor requires that the 32-bit/16-bit variable in the ARM program must be aligned in Word/half, which means that if the order of the variables is unreasonable, it can cause a waste of storage space. For example: 4 32-bit int variables in a struct I1 ~ I4 and 4 8-bit char variables C1 ~ C4, when interleaved in the order of I1, C1, I2, C2, I3, C3, I4, C4, the alignment of integral variables causes the 8-bit ch in the middle of 2 integer variables AR variables actually occupy 32 bits of memory, which creates a waste of storage space. To avoid this situation, type int and char variables should be stored sequentially in the order of similar i1, I2, i3, I4, C1, C2, C3, C4.

2.3 Using the Thumb directive
In order to effectively reduce the code size of the arm, ARM company developed a 16-bit thumb instruction set. Thumb is the extension of ARM architecture. The Thumb instruction set is a collection of most commonly used 32-bit ARM instructions compressed into 16-bit-width directives. At the time of execution, the 16-bit directive transparently decompressed into 32-bit ARM instructions without performance loss. and the program switches between the thumb state and the arm state is 0 overhead. The Thumb code saves up to 35% more memory space than the equivalent 32-bit ARM code.
1 Program running speed optimization
The program running Speed optimization method can be divided into the following arm several major categories.
1.1 General optimization methods
(1) Reduce the operation strength

void Memcopy (char *to, char *from, unsigned int nbytes)
{
while (nbytes--) ARM
*to++ = *from++;
}

void Memcopy (char *to, char *from, unsigned int nbytes)
{
while (nbytes) {
*to++ = *from++;
*to++ = *from++; Arm
*to++ = *from++;
*to++ = *from++;
Nbytes-= 4;
}
}

2.3 uses thumb directives
in order to reduce the code size in a fundamentally effective arm, ARM developed a 16-bit thumb instruction set. Thumb is the extension of ARM architecture. The Thumb instruction set is a collection of most commonly used 32-bit ARM instructions compressed into 16-bit-width directives. At the time of execution, the 16-bit directive transparently decompressed into 32-bit ARM instructions without performance loss. and the program switches between the thumb state and the arm state is 0 overhead. The Thumb code saves up to 35% more memory space than the equivalent 32-bit ARM code.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

On ARM's C code optimization in embedded development

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

On ARM's C code optimization in embedded development

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support