1 Introduction
In recent years, with the development of network and multimedia technologies, the importance and demand of video information communication have increased dramatically, and the key lies in the application of video compression encoding technology. [1] A video encoding scheme based on the DSP is proposed, and the H.264 algorithm is implemented. Compared with H.264, MPEG4 has the advantages of low hardware and software development costs and easier implementation. It is currently the mainstream video encoding applications. In this paper, we propose an implementation method of MPEG4 video encoder based on the DSP. This method can be used in remote video monitoring, video conferencing, and many other fields.
MPEG4 is a universal international video compression coding standard developed by the International Motion Image Expert Group (mPEG, at present, it has developed into an efficient compression algorithm and tool that can adapt to different transmission bandwidths and obtain the best quality images with the least amount of data available. MPEG uses DCT, quantization, entropy encoding, and other algorithms to analyze the shape, motion, texture, and other information to eliminate the correlation between time and space of image data, it has unique advantages such as efficient compression and universal applicability, providing convenient storage and transmission of video information.
MPEG4 defines different frames and levels of encoder and bitstream for the bit rate, resolution, quality, and service of different applications. The simple frame provides the encoding function for rectangular video objects. This article implements a simple frame of the MPEG4 video encoding algorithm.
2 MPEG4 encoder hardware platform
The hardware platform that implements MPEG4 encoder is based on tms320dm642dsp and works with peripheral devices such as external memory SDRAM and flash.
2.1 features
The high-performance fixed-point digital signal processor based on the C64x kernel developed by Ti for multimedia applications. The clock frequency is 600 MHz and the maximum processing capability is 4 800 MIPS. With the public fixed-point instruction set of C6000 series DSPs, the DMPS adds multimedia extension instructions to facilitate the execution of algorithms in image processing. With these features, the product is ideal for video image processing and is an ideal hardware platform for MPEG4 video encoder.
2.2 Hardware System Structure
As shown in hardware platform 1 of the encoder, as the core of the entire system, the DMPS in the figure process video data at high speed and complete the MPEG4 encoding algorithm; the programmable video format conversion circuit pre-processes the input raw video data and converts it into digital signals of the acceptable video formats of the encoder. E2PROM and FLASH are used to solidify the application and initialize parameters, as an off-chip memory, SDRAM stores the video data to be processed during the encoding process. The above three are connected to the DMPS through The EMIF bus. Through the JTAG interface, CCS is used, the real-time clock provides real-time reference information for digital videos.
3 MPEG4 encoder software implementation and Optimization
3.1 MPEG4 software implementation
MPEG4 is an open framework standard without specifying specific algorithms and programs. You can develop code as needed. We use XVID 1.1.0 open source code to implement MPEG4 encoder. XVID code implements the simple frame algorithm of MPEG4. It only encodes I-VOP and P-VOP without the need of shape encoding. However, XVID is designed and developed for PC applications. to transplant it to a DSP, you must analyze the code and modify it based on the instruction structure and features of the DSP.
The MPEG4 encoder implemented by the XVID Code uses each frame in the original video data as a video object. It first determines whether it is an I frame or a P frame, for an I frame, the whole frame of image data must be encoded and stored. For a P frame, motion estimation and compensation are performed. Only the image residual and motion vector between the current frame and the reference frame are encoded. Each frame of data is divided into 16x16 macro blocks, and each macro block is divided into 8x8 sub-blocks. The macro block and sub-block are encoded by DCT, quantization, and VLC. Based on the low image quality requirements, we reduced some functions of XVID, such as GMC (Global Motion Compensation) and RVLC, which reduced the Code complexity and complexity.
3.2 code optimization
To improve code execution efficiency, the Code must be optimized based on DSP features. The optimization is divided into three layers:
3.2.1 project-level optimization
TI provides a powerful integrated development environment (CCS), including various efficient compilation tools, by using the compilation options provided by the compiler (such as-o3 and-pm), the compiler can automatically improve the code structure, reduce the relevance of commands in the code, and use methods such as software flow, improve command concurrency, improve loop performance, and optimize the code size.
3.2.2 C Language Program-level optimization
Evaluate the C code by using the profile tool in CCS to find the program segment with the largest computational workload, such as DCT, quantization, and motion estimation, the optimization of this part of code has a significant impact on improving the performance of the encoder. We adopt the following C-Program-level optimization methods:
(1) rewrite C code using the key words and inline functions unique to C6000 DSP. For example, using the key word restrict can eliminate the correlation between data and improve the Code parallel execution capability, the use of inline functions (such as _ add2 () and nassert () can quickly optimize the C code as a special function for direct ing to the inline C6000 command, it can improve the efficiency of code execution in DSP.
(2) Use an integer to access short data. Use a 32-bit integer to access two 16-bit short data at a time and store the 16-Bit High and Low fields in the 32-bit registers respectively, this can reduce the number of accesses to the memory, double the efficiency of reading data by the program, and then use an inline function that can simultaneously operate on two registers corresponding to high or low 16 bits, such as add2 () and mpy2 () can greatly improve code execution efficiency.
(3) Use the method of loop expansion to change multiple cycles into a few or even a single loop, reduce loop nesting, eliminate redundant loops, and improve the degree of parallel execution of commands.
(4) The DSP does not have a special hardware division operation unit. Division is implemented using continuous subtraction and the calculation workload is relatively large. Therefore, Division operations should be minimized. Division operations that cannot be reduced should be implemented using shift operations, reduces the computing time.
(5) use the TI image library function. TI provides powerful IM-AGE library support, including many common functions for image processing, such as 8x8 sub-Block DCT transform (IMG_fdct_8 x 8), SAD computing (IMG_sad_8 × 8), these functions are optimized, the code efficiency is very high, can be directly applied to the program.
3.2.3 assembly program-level optimization
Linear assembly language is a programming language specific to C6000 series DSPs. It is similar to assembly, but does not need to provide detailed information such as the functional units, registers, and concurrency used by commands, the assembler optimizer can be automatically determined based on code conditions. We use linear assembly to rewrite the key part of the code with a large computing capacity and a high call frequency, such as quantization, DCT, and SAD modules, it further optimizes loop iterations and improves the parallel performance of commands. Table 2 shows how to compare the number of clock cycles consumed by several function module programs before and after Rewriting for the three-frame foreman. qcif test sequence encoding.
3.3 bucket Configuration
The storage space on the DSP is limited, and a large amount of video data (including the current frame and reference frame and other images) to be processed by the encoder must be placed out of the chip, the external access speed of the CPU is much slower than that of the access chip. By using the EDMA function of ED-MA, the CPU simultaneously transmits the data of the previous frame to the memory of the chip in advance, this improves the efficiency of data transfer from off-chip to on-chip, and reduces the CPU wait time.
3.4 experiment results
The encoder is used to encode the standard qcif format (176x144) test sequence to test the performance of the encoder. The news sequence consists of 300 frames, suzie sequence 150 frames, and foreman sequence 400 frames, hardware simulation experiments are carried out using TI's integrated development environment CCS 2.0. The results are shown in table 3 under the condition that the bit rate is set to 100 B/s.
By analyzing and testing the serial encoding results, the encoding speed of the encoder is over 25 fps, which can meet the requirements of real-time encoding. When the transmission bit rate is reduced, the encoding rate can be further improved. From the encoding results, we can find that the compression ratio is different before and after different test sequences, which is caused by the motion and background transformation of the test sequence image, for example, the suzie sequence has a single background, and the motion is relaxed, the compression ratio is relatively low because the news sequence is constantly transformed based on the background. By comparing the pre-encoding and post-encoding decoded images, the image is not distorted, and the image quality is not significantly reduced.
4 knots
This article discusses the implementation scheme and Optimization Method of MPEG4 encoder on dm64, and implements a simple framework algorithm of MPEG4 encoding. The experimental results show that the proposed scheme is easy to implement and practical, the code optimization method added and improved is effective, and the performance test has achieved satisfactory results. On this basis, we can further improve the implementation of MPEG4 advanced framework and code optimization methods, and conduct more in-depth research to meet higher application requirements.