Work Phase:
The workflow is generally divided into three stages.
Phase 1: Use the C language to implement the function as needed. In actual DSP Applications, many algorithms are very complex and can be directly written using assembly code. Although the optimization efficiency is very high, the implementation is very difficult, therefore, it is generally implemented in C language and then compiled and run. The program running time is tested using the profile clock tool in the C64x development environment. If the program running time cannot meet the requirements, the second stage is implemented.
Stage 2: optimization in C Language. Select the optimization method provided by the C64x development environment and make full use of other skills to optimize the C code. If the efficiency requirement cannot be met, perform step 3.
Stage 3: Assembly-level optimization. The lower-efficiency part of the previous stage C program is proposed, written in linear assembly language, and optimized using the Assembly optimizer. The assembler optimizer is used to compile linear assembly language programs without considering the C64x pipeline structure and allocating its internal registers, then the assembler optimizer converts the assembly language program into a high-speed parallel assembler program using the pipeline method through the allocation register and loop optimization.
The above three stages do not have to go through. When the expected performance is achieved in a certain stage, the optimization in the next stage is unnecessary. 1) Select the optimization options provided by the C compiler.
The compiler provides automatic optimization options that fall into several levels and categories, as shown below:
●-O: Enable software flow and other optimization methods
●-PM: Enabling program-level optimization
●-MT: Enable the compiler to further optimize the Code if there is no data storage confusion in the program.
●-Mg: Enable Profile Optimization code
●-MS: ensures that no redundant loops are generated to reduce the code size.
●-Mh: allow speculative execution
●-Mx: enables the software to repeat in a loop. Multiple methods are tried based on the number of cycles to select the best solution.
Select the appropriate Optimization Options Based on the compiled program to optimize the source program. 2) reduce memory relevance
To maximize the efficiency of commands, the C64x compiler tries its best to arrange commands for parallel execution. To perform parallel commands, the compiler must know the relationship between commands, because only unrelated commands can be executed in parallel. When the compiler cannot determine whether two commands are related, the compiler assumes that they are related and therefore cannot be executed in parallel. In the design, the keyword const is often used to specify the target. Const indicates that the storage unit of a variable or variable remains unchanged. Therefore, adding the keyword const to the code can remove the relevance between commands. For example, the following program:
Void vecsum (short * sum, short * in1, short * in2, unsigned int N)
{
Int I;
For (I = 0; I <n; I ++)
Sum [I] = in1 [I] + in2 [I];
}
As shown in figure 2 (a), writing sum may affect the addresses pointed to by the pointers in1 and in2, therefore, the read operations of in1 and in2 can only be performed after the write sum operation is completed, reducing the streamline efficiency. To help the compiler determine the memory relevance, use the const keyword to specify a target, the above source code can be changed to the optimized source code containing the keyword const:
Void vecsum (short * sum, const short * in1, const short * in2, unsigned int N)
{
Int I;
For (I = 0; I <n; I ++)
Sum [I] = in1 [I] + in2 [I];
} As shown in Figure 2 (B), because the keyword const is used, the related paths between commands are eliminated, so that the compiler can determine the correlations between memory operations, find a better command execution solution.
3) Use inline functions (intrinsics)
Inline functions are specialized functions provided by the C64x compiler. They correspond to embedded assembly commands one by one and aim to quickly optimize the C source program. The inline function is called in the source program, which is the same as the ordinary function called, except that the name of the inline function has an underscore before it as a special identifier. When the Assembly instruction function is not easy to be expressed in C language, inline functions can be used. For example, in fixed-point operations, the redundant symbol number of the source operand is often required. If this function is completed in C, the following code is required:
Unsigned int norm (INT src1)
{
Unsigned int sign, result = 0;
Sign = src1 & 0 x80000000;
While (1)
{
If (sign)
{
If (src1 = src1 <1) & sign)
Result + = 1;
Else
Return result;
}
Else
{
If (src1 = src1 <1) | sign)
Return result;
Else
Result + = 1;
}
}
}
The source code is lengthy and has a lot of logic operations and judgment jumps, resulting in low running efficiency. If an inline function is used, result = _ norm (src1) reduces the code length and improves the running efficiency. Therefore, for complex functions that require a large amount of C code to be expressed, use the inline function of C64x as much as possible. 4) int processing of short data
C64xdsp has the double 16-bit expansion function. The chip can perform the double 16-bit multiplication, addition and subtraction, comparison, and shift operations in one cycle. During design, when performing continuous short-type data stream operations, it should be converted to int-type data stream operations, so that two 16-bit data can be read into a 32-bit register at a time, then, we use internal functions to process them (such as _ sub2) and make full use of the Double 16-bit extended function. We can perform two 16-bit data operations at a time, doubling the speed.
5) Use as few function calls as possible
When calling a function, you need to save the PC and some register pressure stacks. When the function returns, these registers are returned out of the stack, adding unnecessary operations. Therefore, it is best to use appropriate inline functions instead of directly writing some small functions into the main function. Some functions that do not call much can also be directly written into the main function, this reduces unnecessary operations and increases the speed. But this will often increase the length of the program, so it is a way to use space for time.
6) Try to use logical operations instead of multiplication and division operations in the DSP. The execution time of multiplication and division operations commands far exceeds that of logical shift commands, especially division commands. during design, some adjustments can be made based on the actual situation. Logical shift operations can be used instead of multiplication and division operations to speed up the running time of commands.
7) use of software assembly line technology
The software assembly line technology is used to schedule the instructions in a loop structure, making it a multi-iteration loop parallel execution. When compiling code, you can select the-O2 or-O3 option of the compiler, then the compiler will arrange the software pipeline as much as possible according to the program. A large number of cyclic operations exist in DSP algorithms. Therefore, the software pipeline method can greatly improve the program running speed. However, the software pipeline has the following restrictions:
● The loop structure cannot contain code calls, but can contain inline functions.
● The cyclic counter should be decreasing.
● The loop structure cannot contain break, the IF statement cannot be nested, and the conditional code should be as simple as possible.
● Do not include the code for changing the cyclic counter in the loop structure.
● The loop body Code cannot be too long. because the number of registers (32) is limited, it should be divided into multiple loops.
In the application of the software pipeline, we should try to break down complicated loops into simple small loops to avoid the number of registers being insufficient. For overly simple loops, we should expand appropriately, to increase the number of codes and the number of iteration commands in the pipeline.
8) adopts command out-of-order Technology
In the program, there are no strict requirements on the execution sequence of some commands, and some location adjustments can be made. Therefore, the positions of these commands can be adjusted appropriately and interspersed with other commands, this reduces the relevance of commands and increases the concurrency during running.
Especially in a loop, when the loop body is small, you can write the code of multiple loops in one loop body and combine them into one loop to reduce the relevance of the instructions in the loop, increase the concurrency of command running. However, be sure not to make the loop too complex, so that you cannot optimize the software pipeline. Because the C language compiled program is not the most efficient assembly language, there is no way to play it in real time. So in order to speed up program execution, we must optimize it to achieve real-time playback speed. However, the c6x compiler also provides optimization commands, such as adding the-O3 parameter during compilation. It can use software to analyze whether our program can be improved. As a result, before generating the assembly language file of group languages, the compiler will continuously compile the C language program we write and reorchestrate the loop part of the program, generate another efficient core loop and reorchestrate the program in the most efficient way to speed up the program. Method 1: Change the floating point operation to fixed point operation because the c6x DSP board does not support floating point operation, however, our original program code is in the format of floating-point operations, so we must change it to a fixed-point operation, and the modified execution speed will be much faster. We use the Q-format specification to represent floating point operations. The following describes the related principles.
The fixed-point DSP uses a fixed decimal point to represent digits in the decimal part, which also limits the use. To classify decimal points in different ranges, we must use the Q-format. Different Q-formats indicate different decimal places, that is, the range of integers. Table 2 shows the format of q15 numbers. Note that each digit after the decimal point indicates that the next digit is 1/2 of the previous digit, while MSB (most-significant-bit) is specified as the number (sign bit ). As can be seen from table 2, when the number is set to 0 and the remaining bits are set to 1, the maximum positive number (7 fffh) can be obtained ); when the number is set to 1 and the other digits are set to 0, the maximum negative number is obtained.
(8000 h ). Therefore, the q15 format ranges from-1 to 0.9999694 (@ 1), so we can increase the range of the integer part by shifting the decimal point to the right, as shown in table 3, the range of q14 format is increased to-2.0 to 1.9999694 (@ 2), but the increase in the range sacrifices the accuracy. Method 2: Create a table)
The original program was designed to read not only the files of AAC, but also the content of some C language program code During Decoding for computation, for example, if you read some numeric values for sin, cos, and exp operations, but in order to speed up the execution of the program, the results of these operations are built into a table and built into the program, you do not need to perform additional computation operations to accelerate the program. Method 3: Reduce the length of the program
1. Remove the debug function
When the original program was in the debug stage, it added a lot of parts for error detection. After the program debug is complete, no errors have occurred, so we can remove these parts, in order to reduce the length of the program, it can also reduce the number of time pulse during program execution, speed up the program.
2. Remove the computation time (clock) Function
The original program can calculate the time pulse number required by the execution program, and we can also remove these parts. If there is a need to calculate the time pulse, we can use the c6x tool software, more powerful. Method 4: Reduce the I/O Process
In the original operation of decoding, a part of the AAC file is read first for decoding. After decoding, the part is read and then decoded. However, because the c6x board and PC are quite slow in reading files, reading takes most of the time, so I changed the program to read all the AAC files to the c6x memory first, then perform decoding. Or build An AAC table (about 1 MB) to avoid insufficient memory on the DSP board. Method 5 Reduce the call of a subroutine
When calling a subroutine, you must first put the content of the cache into the stack. When returning the content from the subroutine, you must also extract the original content of the cache from the stack. However, some subprograms are short in length and frequently called. They can be completed in a few minutes, but they waste time accessing the stack content, therefore, we simply write these short subprograms directly in the main program to reduce the number of time pulses. Method 6 Write Assembly Language
Although the compilation language compiled by the C language can be correctly executed, this compilation language is not the most efficient way of writing. To increase the program efficiency, in some places, for example, some functions that have been called many times and the program code is not long must be replaced by self-writing assembly languages. Method 7 use the concept of Parallel Processing
C6x is a powerful processor. Its CPU provides eight units that can execute different commands, that is, it can process up to eight commands at the same time. So if we can use it for parallel processing, we can greatly shorten the execution time of the program, and use it most efficiently for the solution action. Finally, we need to know that Level 3 optimization (-O3) is inefficient (experience), and there are also some examples of using a 32-bit READ command to read two adjacent 16-bit data, for details, refer to the C optimization manual. But these efficiency is not high (although TI's propaganda says it can reach 80%, I did not find this efficiency when I did it myself! 65%), if you want to improve the efficiency, you can only use the assembly. Also, let's take a look at how your C program was compiled. If there are many interruptions in it, 6000 can be said to have no advantage. In addition, the data of profiler is not accurate, which is much larger than the actual data. In addition, the DSP is particularly slow during initialization, so the time should not be compared with that of the PC. If it is more than the core part. The debug tool for profile: c6x provides a profile interface. In Figure 9, there are several important windows. The window in the upper left corner shows the C language we have written, so that we can know what we have done. The window in the upper right corner shows the compilation language compiled by c6x. We can also know which step we have done. The window in the lower left corner is the command column, which is the window for us to run commands and display messages. The profile window in the middle is the most important window in profile mode. The following table shows the items:
Table 5: profile parameters [8]
Field meaning
Count number of calls
Total number of executed clock containing subprograms
Incl-Max contains the maximum number of clock statements executed at a time.
Exclusive does not include the total number of clock executions of subprograms
Excl-Max does not include the maximum number of clock statements executed at a time.
This profile mode can be used to analyze the number of calls to each function in the program, the number of times the function is executed, and so on. With the results of this analysis, we can know which function has the most time pulse, which can be improved and optimized for it.
Optimization of assembly code
After the optimization of the C code, you can use the profile
The clock tool finds out the inefficient part and rewrites it with linear assembly. Compile with the Assembly optimizer to complete the following functions from the input linear assembly code:
● Search for CPU commands that can be executed in parallel.
● Process the assembly line label During the software assembly line.
● Usage of the allocation register.
● Allocate function units.
The Assembly optimizer provided by TI can achieve high efficiency and generally meet the performance requirements.
Optimization Problems
During the optimization process, you always need to make certain changes to the program, so that some problems often occur.
1) Verification of optimization results
Optimized programs often do not know whether they are running correctly. This requires verification. Generally, the test sequence is used for verification. Test sequence refers to a set of special data obtained from different algorithms, which can accurately reflect the characteristics of the algorithms. Each group of data in the test sequence includes input and output data. The calculation result is compared with the output data to determine the correctness of the program. Some common algorithms generally provide test sequences. There are also some, no test sequence. In this case, you need to construct a test sequence based on the characteristics of the algorithm for verification. During the construction, it is recommended that there be several groups of sequences, and the data should have a certain length, so that the verification is more accurate.
2) Memory leakage
The internal storage space of C64x series DSPs is 1 MB, and the second-level cache of programs, data, and CPU will share the space. Therefore, when the program runs abnormally, this is probably caused by memory leakage. Therefore, in the program design, we should try not to use pointers, but also pay attention to border detection.
Some programming methods
In programming, everything is designed to meet actual requirements. In the actual design, in addition to improving the performance, you can also take other measures to improve the program running performance by using the characteristics of DSP, to meet the actual design requirements.
1) Put programs and frequently used data into the on-chip RAM
In-chip RAM and CPU
Working at the same clock frequency is much higher than the off-chip RAM performance. Therefore, placing programs in a chip can greatly improve the running speed. At the same time, for some frequently used data, put into the chip, it will save processing time.
2) data migration through DMA technology
For the C64x chip, the memory in the chip is 1 MB, but for some large image processing algorithms, it may not be enough. Therefore, the DMA technology is often used, moving the required data into the tablet and removing unnecessary data can greatly improve the program running speed.
3) Use of Cache
Increasing the cache can significantly improve the performance. However, the program and data in the C64x series DSPs also have cache shared in-chip RAM. Therefore, increasing the cache reduces the actual available space in-chip. Pay attention to this during design.