We selected jm6.1e reference software published by ITU-T as our optimization object, the goal is to achieve a base-line profile real-time codec algorithm. However, the jm6.1e code is complex and redundant. Therefore, you need to make major adjustments on the PC end, involving the following work: remove redundant code, standardize program structure, global and local variable adjustment and redefinition, and structure adjustment.
1. code migration
Code porting is to port the program running on the PC end to the DSP end so that it can run initially. The main problems to consider are memory allocation and syntax rules.
2. DSP-side code optimizationBy. 264 code DSP, which can be implemented on DSP. 264 of the CODEC algorithm, but the efficiency of this algorithm is very low, because all the code is written in C language, and not fully utilize the performance of DSP. Therefore, we must further optimize the DSP based on its own characteristics to implement real-time video image processing by using the H.264 Video Decoder algorithm.
Code optimization is divided into three levels: project-level optimization, algorithm-level optimization, and command-level optimization.
* Project-level optimization
Is the overall optimization of the project, the main means are as follows:The first is to use the optimization function provided by the CCS compiler to select and configure optimization options, such as opening O-3 options. Second, adjust the program structure and rewrite statements that are not suitable for DSP execution to improve code concurrency. The last step is to rationally allocate memory. Due to limited DSP resources, we allocate some common data, such as global variables and programs, to the memory with high access speeds, allocate data that occupies a large amount of space outside the credits, such as frame storage.
* Algorithm-level optimizationIt uses H.264's own characteristics to propose a fast and efficient algorithm to mine potential from the algorithm, increase the running speed, and achieve the goal of optimization. This part of work focuses on the optimization of encoder. In video encoding, the motion estimation part is the largest part of the calculation workload. Research shows that for H. 264. For single frame reference, motion estimation accounts for 70% of the total calculation workload. For five frames of reference, this ratio can reach 90%. Therefore, it is necessary to propose an effective and fast motion estimation algorithm, through research, we propose a motion estimation algorithm based on prediction and early stop technology. The main method is to use neighboring blocks to predict the current block motion vector and set the adaptive threshold, stop search in advance. The algorithm we proposed increases the search speed by more than 4225 times compared with more than 1000 points of the full search algorithm by about 3-4 Average search points in each search window at 32. Compared with some classic fast algorithms, the advantage is also obvious. In H.264 algorithms, the motion estimation of sub-object uses full search, and 16 points need to be searched with a precision of 264. We proposed our own sub-object fast search algorithm, with an average of 7 search points, saving more than 60% of the computing workload. The new algorithm improves the encoding speed significantly, and the quality is also good. The SNR loss is less than 0.06db, And the bit rate increases by about 2%. This is negligible for motion estimation algorithms. In addition, we proposed an Adaptive Mode Selection Algorithm for the block size matching mode in inter-frame encoding 7 and the mode in intra-Frame Prediction 13, which is too complex and requires a large amount of computing, you do not need to calculate all the modes to find a relatively optimal mode. These algorithms greatly increase the code running speed and reach a good compromise between speed and quality.
* Command-level optimizationIf the above optimization methods fail to meet the real-time requirements, the command-level optimization is required. The main methods are as follows. · Loop disassembling: Open the For Loop in C language, arrange the pipeline, and improve the parallelism. Call the rich inline functions provided by the system. Adjust the data structure. Data that requires large-scale access will be processed, place them together in the memory to facilitate access by the DMA mechanism or to process parallel commands, such as interpolation function modules. · Extract time-consuming functions and rewrite them with linear assembly. Make full use of the rich media processing commands [5] to maximize the use of DSP concurrency. For example, the sad computation that is frequently called in motion estimation is to make a difference to the corresponding pixel point and calculate the absolute value and sum of the residual field. The original algorithm calculates the difference for each pair of pixel points separately, and then accumulates the absolute values. We have adapted linear assembly and used subabs4 (one-time difference between two pairs of 4-byte data and absolute value ), dotpu4 (inner product of two to four bytes of data at a time) and LDW/ldnw (Reading 4 bytes of data at a time) commands greatly improve code concurrency. For 16 × 16 blocks, more than 1000 commands are required before optimization. After optimization, 200 is enough. We make full use of the System concurrency to rewrite the time-consuming functions in assembly language, involving functions such as DCT transformation, anti-DCT transformation, whole-pixel motion estimation, sub-object search, and intra-frame encoding functions, interpolation functions.
Algorithm performance evaluation and ProspectIn the nvdk c6416 environment, the encoding and decoder algorithm is tested. For the qcif test sequence, the encoding speed of the encoder is 40_50 frames/second, And the decoder can reach the decoding speed of 50_60 frames/second, it is far from achieving the goal of real-time decoding. Because of the Code compatibility and portability, we can transplant the CODEC algorithm implemented on c6416 to the Media Processing dedicated chip of the media processing chip of the TI Company, use its rich media processing interfaces and coprocessor to achieve better performance.