Implementation of H.264 Video Encoder Based on ADSP-BF561

Source: Internet
Author: User

Implementation of H.264 Video Encoder Based on ADSP-BF561
[Date:] Source: China Power Grid Author: Cui Haiyan, Wang Qing [Font: large, medium, and small]

 

 

0 Introduction

H.264/AVC is the latest international video coding standard jointly developed by ITU-T VCEG and ISO/iec mpeg. H.264's video encoding layer (VCL) adopts many new technologies, which greatly improves its encoding performance. However, this is at the cost of doubling the complexity, which makes H.264 face huge challenges in real-time video encoding and transmission applications. Therefore, to meet the real-time requirements of image compression, We need to optimize the existing H.264 codecs. This article mainly discusses H. 264 hardware platform and task flow of the system. In view of the features of the DSP hardware platform, this paper introduces the code-level optimization of the algorithm to further improve the computing speed of the encoding algorithm and implement H. 264 specific real-time encoding methods. As ADI Blackfin561 is a high-performance digital signal processor launched by AD, it has a MHz clock speed. Therefore, this paper chooses it as the hardware platform to explore an effective way to implement the H.264 Encoder on a DSP Platform with limited resources.

1. Hardware Platform

1.1 ADSP-BF561 Processor

Blackfin561 is a high-performance fixed-point DSP video processing chip in the Blackfin series. The clock speed can reach up to 750 MHz. The kernel contains two 16-bit multiplier MAC, two 40-bit accumulators ALU, four 8-bit video ALU, and one 40-bit shift. The two sets of data address generators (dags) in the chip can provide addresses for simultaneously accessing the dual-operand from the memory, and can process the merge M Multiplication operation per second. The chip has dedicated video signal processing instructions and KB of memory L1 (16 KB instruction Cache, 16 KB instruction SRAM, 64 KB data Cache/SRAM, 4 KB temporary data SRAM), 128 KB in-chip L2 memory SRAM, and dynamic power management function. In addition, the Blackfin Processor also includes a wide range of peripheral interfaces, including EBIU interfaces (4 128 mb sdram interfaces and 4 1 MB asynchronous memory interfaces), 3 Timing/counter, 1 UART, 1 SPI interface, 2 Synchronous Serial Interface and 1 Parallel Peripheral Interface (supporting ITU-656 data format) and so on. The structure of the Blackfin Processor fully reflects the support for media applications (especially video applications) algorithms.

1.2 Video Encoder Platform Based on ADSP-BF561

The hardware structure of the Blackfin561 video encoder is shown in Figure 1. The hardware platform uses the ADSP-BF561 EZ-kit Lite evaluation board of ADI. This evaluation board includes 1 ADSP-BF561 processor, 32 mb sdram and 4 MB Flash, the AD-V1836 audio codecs in the board can be external 4-input/6 output audio interfaces, the ADV7183 Video Decoder and the ADV7171 video encoder can connect to an external 3 input/3 Output video interface. In addition, the evaluation board also includes one UART interface, one USB debugging interface, and one JTAG debugging interface. In Figure 1, the analog video signal input by the camera is converted into a digital signal through the video chip ADV7183A, Which is compressed from the Blackfin561 PPI1 (parallel external interface) to the Blackfin561 chip, the compressed code stream is converted by ADV7179 and output from the PPI2 port of the ADSP-BF561. The system can load programs through Flash and support serial port and network transmission. Raw images, reference frames, and other data during the encoding process can be stored in SDRAM.

Main features of 2 H.264 video compression encoding algorithms

The video encoding/decoding standard mainly includes two series: one is the MPEG series and the other is the H.26X series. Among them, the MPEG series standards are developed by ISO/IEC (International Organization for Standardization), and The H.26X series standards are developed by ITU-T (International Telecommunication Union. I-TU-T standards include H.261, H.262, H.263, H.264 and so on, mainly used for real-time video communication, such as video conferencing.

H. 264 video compression algorithm adopts a block-based mixed encoding method similar to H.263 and MPEG-4. It adopts two encoding modes: Intra-frame encoding (Intra) and Inter-frame encoding (Inter. To improve coding efficiency, compression ratio, and image quality, H.264 adopts the following new encoding technologies:

(1) H.264 Video encoding systems are divided into two levels by function: the Video encoding Layer (VCL) and the Network Abstraction Layer (NAL, Network encoding action Layer. VCL is used to compress video sequences efficiently, while NAL is used to standardize the format of video data. It mainly provides header information for transmission and storage of various media.

(2) Advanced intra-frame prediction, which uses 4 × 4 prediction for macro blocks containing more airspace details, for relatively flat areas, a 16 × 16 prediction model is used. The former has nine prediction methods, and the latter has four prediction methods.

(3) More types of block Division are used for inter-frame prediction, the standard defines seven macro blocks of different sizes and shapes (16 × 16, 16 × 8, 8 × 16) and sub-macro blocks (8x8, 8x4, 4x8, 4x4 ). Because smaller blocks and adaptive encoding methods are used, the amount of data with the predicted residual can be reduced, further reducing the bit rate.

(4) high-precision Motion Prediction Based on the accuracy of 1/4 pixels.

(5) Multi-reference frame prediction can be performed. During Inter-frame encoding, a maximum of five different reference frames can be selected.

(6) integer Transformation (DCT/IDCT ). The fixed point operation is used to replace the floating point operation in the previous DCT Transform for the 4 × 4 integer transformation technology of residual images. To reduce the coding time, but also more suitable for porting to the hardware platform.

(7) H.264/AVC supports two entropy encoding methods: CAVLC (context-based adaptive variable-length coding) and CABAC (context-based adaptive arithmetic coding ). Among them, the error resistance capability of the CAVLC is relatively high, but the encoding efficiency is lower than that of CABAC. The encoding efficiency of the CABAC is high, but the calculation amount and storage capacity are greater.

(8) adopting new loop Filtering Technology and entropy coding technology.

H. these new technologies of 264 make motion image compression technology a huge step forward, it has better than MPEG-4 and H.263 compression performance, it can be used in high-performance video compression fields such as Internet, digital video, DVD, and television broadcasting.

3 H.264 Video Encoding Algorithm Implementation

The improvement of H.264 on DSP involves three steps: C algorithm optimization on PC, program transplantation from PC to DSP, and code optimization on DSP platform.

3.1 C algorithm optimization on PC

According to the system requirements, this design selects the ITU Jm8.5 baseline profile as the standard algorithm software. The ITU reference software JM is designed based on a PC, so it can achieve a high coding effect. When porting video codec software to a DSP, DSP system resources should be taken into account. The main consideration should be system space (including program space and Data Space). Therefore, you need to evaluate the original C code to understand the transplanted code. Figure 2 shows the algorithm structure of H.264.

After learning about the algorithm structure, you also need to determine the amount of computing and time-consuming part in the implementation of the encoding algorithm. According to the profile analysis tool provided by VC6, the encoding between frames and frames occupies more than 60% of the overall running time. Among them, ME (Move Estimation, motion Estimation) occupies a large amount of time. Therefore, the focus of transplantation and optimization should be on motion estimation. Therefore, the code structure should be adjusted.

(1) substantially Delete unnecessary files and functions

Because baseline and single reference frame are used, many files and functions can be deleted, including redundant program code that does not support features such as B frame, Si, SP and data segmentation, hierarchical encoding, weight prediction mode, and cabac encoding mode, and RTP. c. Sei. c, leaky_bucket.c, In-trafresh.c files, related header files and in global. h. The global variables and functions defined in the header file. In addition, you can also delete global variables such as top_pic and bottom_pic related to local variables, hierarchical encoding, multi-slice segmentation, and FMO, which are related to field encoding, frame field adaptive encoding, and Macro Block Adaptive encoding. prediction, reference frame sorting, input/output, and decoder cache operations; you can also delete redundant Code related to the macro block refresh mode and weight prediction mode in random frames (for example, enable the encoder to use the nal code stream instead of the RTP format) and RTP. c; Sei. C contains some auxiliary encoding information (not included in the code stream). if not used, you can also delete leaky_bucket.c to calculate the leakage cache. .

(2) rewrite the configuration function

Because JM's system parameter configuration is through reading encoder. the cfg file, so you can change the parameter configuration from a read file to a value assignment function in the initialization set, which reduces the amount of code, this reduces the memory usage and read time, and improves the overall encoding speed of the encoder. For example, the input-> img_height variable defined as int type can be directly rewritten to input-> img_height = 288 (CIF format ).

(3) Remove redundant printing information

To facilitate debugging and Algorithm Improvement, JM retains a large amount of printed information. To increase the encoding speed and reduce the storage space consumption, such information can be completely deleted, such as a large number of trace information and encoding data statistics files. If lor. dat and Stat. dat only need to be used for debugging on the PC and do not need to be transplanted to the DSP platform, the Code related to this part can be completely removed. However, the basic information required for debugging (such as bit rate, signal-to-noise ratio, and encoding sequence) should be retained for reference.

By adjusting the code, the structure and capacity of the code can be reduced, so as to prepare for the subsequent transplantation on the DSP.

3.2 program porting from PC to DSP

To transplant a streamlined PC program to the visual DSP of the development environment of the ADSP-BF561, so that it can run initially, the main issues to be considered are syntax rules and memory allocation. (1) Remove all functions not supported by the compiling environment

It mainly removes some time-related functions, modifies file operations to read file data cache operations, deletes SNR information collection, and other unnecessary code on DSP platforms. Note: The function declaration and data structure type must comply with the C language format of DSP.

(2) Add hardware-related code

This Code includes system initialization, output module code, interrupt service program, and code rate control program.

(3) configure the LDF File

Because the transplanted code is often very large in both data and programs, it must not be placed in the SRAM, and there will be a problem with the link. At the beginning, it is best to put all the programs and data in the SDRAM so that the link will not be faulty. Stack and heap are similar, both of which are placed in SDRAM first. Generally, at the beginning, a program can be correctly run, and the speed is second.

(4) solve the malloc Problem

Malloc is a problem to be solved during DSP development. If you dynamically apply for memory, even if you can run it, the results are often incorrect. Therefore, it is best to conduct static allocation, which can be allocated in the form of arrays.

After transplantation, we can realize the ADSP-BF561 processor based on the hydrogen 64 encoding. At this time, if the speed does not meet the requirements of real-time encoding, can be further optimized.

4. code optimization on DSP Platform

In the visual DSP development environment, the main methods for code optimization include C language-level optimization and assembly-level optimization.

4.1 C language-level optimization

The profile analysis tool of VC6 found that the focus of transplantation and optimization should be on motion estimation. After comparing various algorithms, the author selects the diamond (DS) search method. The DS algorithm can use two Search templates, which are the Large template LD-SP (Large Diamond Search Pattern) with nine Search points) SDSP (Small Diamond Search Pattern), a Small template with five Search points ). Its Diamond Search 3 shows. When searching, the algorithm is first computed using a large template. When the smallest block error is generated at the center point of the SAD node, the large template LDSP is converted to the SDSP for matching, if the smallest SAD member in the Five Points is the center point, the point is the optimal matching point, and then the search is completed. Otherwise, the SPSS search is performed based on this point.


JM experiments show that using this method can save about 10% of the running time, and the amount of code does not increase too much.

Based on DSP features and related hardware commands, the code can be optimized as follows during design:

◇ Adjust the program structure. Rewrite statements that are not suitable for DSP execution to improve code concurrency.
◇ Use of macros. That is to say, change the function with a relatively short execution time and a large number of calls to a macro.
◇ Loop optimization is to open the for Loop in C language, arrange the pipeline, and improve concurrency.
◇ Computation the table is to convert the parameters calculated at run time into constant values for easy search, and convert the running computation into compilation computation. For example, when processing the number of shifted digits in the quantization and antiquantization programs, all possible values can be calculated first, and the subsequent calculation can obtain the values through the Table query.
◇ Fixed point floating point. Because Blackfin561 does not support floating-point operations, but the original program code is in the format of floating-point operations, it must be changed to fixed-point operations, and the modified execution speed will be much faster.
◇ Use logical operations instead of multiplication and division operations. Given that the execution time of multiplication and division operations commands is much longer than that of logical shift commands, especially division commands, logical shift operations should be used instead of multiplication and division operations to speed up the operation of commands.
◇ Use as few function calls as possible. For some small functions, it is best to use appropriate inline functions to directly write them into the main function for replacement. For some functions that do not call much, you can also directly write them into the main function, this reduces unnecessary operations to increase the speed.
◇ Reduce judgment conversion.
◇ Static memory allocation as much as possible.
◇ Call the rich inline functions provided by the system.

In addition, in order to make full use of the DSP's operational capabilities, it is also necessary to start from its hardware structure and make full use of its eight functional units, use the software pipeline to allow parallel execution without conflict. You can also extract the most time-consuming functions and rewrite them with linear assembly to maximize the use of DSP parallelism.

4.2 assembly-level optimization

Assembly-level optimization mainly involves the following operations:

(1) Use register resources

Blackfin561 provides 8 32-bit data registers and a series of address registers. When using registers instead of local variables, if local variables are used to save intermediate results, using registers instead of local variables can save a lot of memory access time.

(2) use special commands

Blackfin561 provides commands for maximum value, minimum value, absolute value, cup, and a large number of videos. It is possible to use multiple commands to access data with few digits. Using these commands can greatly improve the code execution speed. For example, when using the int type (32-bit) to access two short (16-bit) type data, you can put it in the high 16-bit and low 16-bit fields of the 32-bit register respectively. In this way, the data reading efficiency can be doubled, thus reducing the number of memory accesses.

(3) Use parallel commands and vector commands

Each general instruction in the ADSP-BF561 can be executed in parallel with one or two memory access instructions, this is conducive to the ADSP-BF561 assembly line full load operation, give full play to the ADSP-BF561's data processing capabilities.

(4) reasonably store the program segments for repeated calls

Put the program segments that have been called repeatedly (such as DCT transformation and IDCT transformation) in the program storage area of the chip, and put frequently used data segments (such as encoding tables) in the On-chip data storage, instead, you can store infrequently used programs and data segments in off-chip storage to avoid unnecessary repeated migration of programs or data.

(5) Rational use of internal and external memory

The BF561 slice only has a storage space of KB. Therefore, the current frame, reference frame, and reconstruction frame of the current frame must be stored in the off-chip storage. If the compression code stream is read by the host, it can be stored outside the slice. Other data, such as program code, global variables, VLC code table, and intermediate data generated by each encoding module, can be stored in the chip.

(6) DMA usage

Because the speed of the CPU to access the off-chip memory is usually dozens of times slower than that in the access chip, the transmission of off-chip data is usually the bottleneck during program running, so even if the code efficiency is high, the pipeline will also be severely blocked due to waiting for data. An effective way to solve this problem is to transmit data using DMA. The program is coded one by one macro block. When the front macro block is encoded, the DMA first transmits the data of the next Macro Block and the reference frame data used to the chip from the outside, after the macro block is motion compensated, DMA transfers the reconstructed macro block from the chip to the outside. In this way, the CPU only operates on On-chip data so that the pipeline can run smoothly. The compressed code stream is written one by one according to the time interval of the code word, and can be directly written to the off-chip by the CPU.

5 conclusion

After the optimization program of the corresponding function modified by ADSP-BF561 assembly language is debugged and run, the IDCT efficiency of DCT is improved by about 15 times, and the efficiency of block filter is improved by about 6 ~ Seven times. Other functions in the module have also achieved good optimization results. It shows that the optimization has achieved good results.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.