The basic method of TI DSP c64x optimization

Source: Internet
Author: User

first, the optimization process is generally divided into three stages

Phase one: directly as needed in C language to implement the function. In the actual DSP application, many algorithms are written directly with the assembly code, although the optimization efficiency is very high, but the implementation of the difficulty is very large, so generally first use C language to achieve, and then compile and run, using the C64X development Environment Profile,clock Tool test program run time, if not meet the requirements, Then proceed to the second stage.

Stage Two: C language-level optimization. Choose the optimization method provided by the C64X development environment and take advantage of other techniques to optimize the C code and, if not, to meet the efficiency requirements, take the third step.

Phase Three: assembly-Level optimization. The lower optimization of the previous Stage C program is proposed, written in linear assembly language, and optimized by the Assembly optimizer. The role of the Assembler optimizer is to allow the developer to write a linear assembly program without considering the C64X pipeline structure and allocating its internal registers, and then assemble the optimizer to transform the assembly language program into a high-speed parallel assembler using pipelining by allocating registers and looping optimizations. None of the three phases above must be passed, and when the desired performance is achieved at a certain stage, the next phase of optimization is not necessary.

Second, choose the C compiler to provide optimization options

-O: Enable software pipelining and other optimization methods

-PM: Enable program-level optimization

-MT: The Enable compiler assumes that there is no data store confusion in the program to further optimize the code.

-MG: Enable profiling (profile) Optimization code

-ms: Ensures that no redundant loops are generated, thus reducing code size

-MH: Allow speculative execution

-MX: Enable software pipelining cycle retry, based on the number of cycles to cycle through multiple scenarios, in order to choose the best solution.

reduce memory correlation for maximum efficiency of the instruction

The C64x compiler arranges directives as much as possible in parallel execution. In order for the instruction to operate in parallel, the compiler must know the relationship between instructions, because only irrelevant instructions can be executed in parallel. When the compiler cannot determine whether two directives are relevant, the compiler assumes that they are relevant and cannot be executed in parallel. The keyword const is often used in the design to specify the target, and the const indicates that a variable or a variable's storage unit remains unchanged. Therefore, by adding the keyword const in the code, you can remove the correlation between the instructions.

For example, the following program:

Voidvecsum (short *sum,short*in1,short*in2,unsigned int N)

{

Inti

for (i=0;i<n;i++)

Sum[i]=in1[i]+in2[i];

}

The write sum may have an effect on the address pointed to by the pointer in1, in2, so that the read operation of In1 and in2 must wait until the write sum operation is complete before it can be done, reducing the flow efficiency, to help the compiler determine the correlation of the memory, using the const keyword to specify a target, The source program above can be changed to the optimized source code with the keyword const:

Voidvecsum (short *sum,const short*in1,const short*in2,unsigned int N)

{

int i;

for (i=0;i<n;i++)

Sum[i]=in1[i]+in2[i];

}

The use of the keyword const eliminates the associated path between instructions, enabling the compiler to identify the correlation between memory operations and to find a better command execution scenario.

Iv. use of inline functions (intrinsics)

Inline functions are specialized functions provided by the C64X compiler, which correspond to the embedded assembly instruction one by one, which is designed to quickly optimize the C source program. Calling inline functions in the source program is the same as calling a general function, except that an underscore is a special identifier before the name of the inline function. When the assembly instruction function is not easy to use C language expression, it can be represented by an inline function. For example, in fixed-point operation often requires the number of redundant symbols of the source operand, this function if completed with C, the need for the source code lengthy, there are more logical operation and judgment jump, running inefficient. If the inline function is the result =_norm (SRC1), the code length is reduced and the running efficiency is improved. Therefore, for complex functions that require a large amount of C code, it should be represented as c64x inline functions.

int processing of short type data

C64X DSP has dual 16bit expansion function, the chip can be completed in a cycle of double 16bit multiplication, addition, subtraction, comparison, shift and other operations. In the design, when the continuous short data flow operation, should be converted to an int data flow operation, so that two 16-bit data can be read into a 32-bit register, and then use internal functions to deal with them (such as _SUB2, etc.), the full use of the dual 16bit expansion function, You can perform two 16bit data at a time, and the speed will increase by one more times.

Vi. make function calls as few as possible

When a function is called, the PC and some register presses are saved, and when the function returns, the registers are returned, adding unnecessary operations. So some small function, it is best to use the appropriate inline function instead of directly write to the master function, some calls not many functions, can also be directly written into the master function, which can reduce unnecessary operations, improve speed. But this tends to increase the length of the program, so it is a way to use space for time.

VII. use logical operations instead of multiplication operations

In the DSP, the execution time of the multiplication operation instruction is much more than the logical shift instruction, especially the division instruction, in the design time, can make some adjustments according to the actual situation, as far as possible to use the logical shift operation to replace the multiplication operation, this can speed up the instruction running time.

viii. use of software pipelining technology

The software pipelining technique is used to schedule the instruction of a cyclic structure, which makes it a parallel execution of multiple iterative loops. When compiling the code, you can choose the compiler's-o2 or-o3 option, and the compiler will arrange the software pipeline as much as possible according to the program. There are a lot of cyclic operations in the DSP algorithm, so it can greatly improve the running speed of the program by using the software pipelining method fully. But there are several limitations to using the software pipeline:

A looping structure cannot contain code calls, but it can include inline functions.

The loop counter should be decremented.

Loop structure cannot contain BREAK,IF statements cannot be nested, the condition code should be as simple as possible.

The loop structure does not contain code that alters the loop counter.

The Loop body code cannot be too long, because the number of registers (32) is limited and should be decomposed into multiple loops.

In the application of the software pipeline, the complex cycle should be decomposed into a simple small cycle to avoid the number of registers, and for too simple loops, it should be expanded to increase the number of code and increase the iteration instruction in the pipeline.

Nine, the use of order Chaos technology

In the program, some instructions do not have strict order, can make some adjustment, so can adjust the position of these instructions, interspersed with other instructions, so as to reduce the relevance of the instruction, increase the parallelism of the runtime. Especially in the loop, when the loop body is small, multiple loops of code can be written in a loop body, combined into a loop, thereby reducing the correlation of the instruction in the loop, increasing the parallelism of the instruction run. Be careful not to make the loops too complex to optimize the software pipeline.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.