The following analyzes some experiences to learn about video decoding optimization:
1. Implement MPEG4 video decoding in an Embedded System
There are two feasible methods
(1) Adopt FFMPEG (the core of mplayer is FFMPEG), and then decode FFMPEG MP4.
1) Compile IDCT and optimize the implementation of VLD-> inline & assembler
2) group the data in MB based on the size of arm9-cache & cache line, so that multiple MB can be processed at the same time each time.
VLD ---> IDCT --> MC -- ......> Coupling
3) Optimize memory access (MC) for key code segments-> inline & assembler
4) do not use the built-in img_convert () of FFMPEG for yuv2rgb conversion-> inline & assembler
5) Optimize the arm Instruction Set of the decoding database-> optimize the architecture
Configured FFMPEG with CPU = armv4l wocould give you a better performance
If you have IPP, you can enable it, you can obtain huge enhancement
IPP = intel? Integrated performance primitives Intel High-Performance Component Library (only for XScale)
(2) Using Xvid, FFMPEG contains too many decoding libraries, if you only do MPPEG-4 decoding, why use such a complex library?
BTW is the best version of XviD 0.9.2 in embedded systems.
Because version 1.1.0 contains many as features, it is usually not required in embedded systems and is not easy to implement.
If you want to develop your own Encoding algorithms, you cannot always rely on others. You 'd better spend it yourself.
Work hard to achieve and optimize. Therefore, I think xvid0.9.2 is better than 1.1.0.
In fact, on a platform with a clock speed of MHz, it is still easy to optimize the Xvid algorithm to achieve real-time CIF decoding. It can be more than a month at most.
2. Impact of video decoding process on decoding
Video decoding optimization generally involves a large amount of code, and the source code is often obtained from other places. Therefore, it is difficult to read the code, let alone the optimization. Recently, we have some experiences in optimizing realvideo:
1) before reading the code, you must first familiarize yourself with the process and grasp the key points, such as video decoding, entropy decoding, anti-quantization, anti-transformation, interpolation, reconstruction, filtering, and reference frame insertion.
By grasping these points, you can quickly separate the code.
2) analyze the decoding process to understand the minimum buffer required for decoding and the bit width of each buffer.
3) track the code buffer flow based on the known process and check whether there are excess memory copies. Try to reduce the buffer. Experience shows that the speed improvement brought about by reducing the buffer is much higher than the optimization of local algorithms. -> Coupling
4) check whether the program structure order is reasonable. an unreasonable program structure will lead to an increase in the buffer.
After studying the video decoding sequence in the past two days, it is found that the efficiency of interpolation and then inverse transformation is much higher than that of inverse transformation and then interpolation. The reason is that the bit width after interpolation is 8 bits, usually 9 bits is after inverse transformation. Therefore, the value after interpolation needs to be saved before reconstruction, which saves half of the space, in this way, the accessed memory will be much less during reconstruction. As far as I know, most highly efficient decoders adopt interpolation and reverse transformation, and are rebuilt immediately after the conversion. This not only reduces memory usage, but also avoids high memory access jitter, eventually reduce cache miss. -> Modify decoding process coupling
3. Impact of the cache mechanism on decoding
Look at http://www.hongen.com/pc/diy/know/mantan/cache0.htm first
Write-through (Direct Writing) and write-back (back-to-write) have different operations. In different scenarios, different memory blocks use different write-back policies (if your system can implement them), which is much more efficient than using one policy. Specifically, the memory block that is repeatedly accessed is set to write back, and the memory used after a write operation for a long time is set to write-through, which can greatly improve the cache efficiency.
The first point is easy to understand, and the second point needs to be considered. Because the write-through operation is that when data with this address is cached, the cache and primary storage are updated at the same time, when the cache does not have the address data, it is directly written to the primary storage, ignoring the cache. When the data of this address is used for a long time, the data is definitely not in the cache (replaced), so it is better to directly write the data to the primary storage;
Conversely, if you use the write-back operation, when the cache contains the address data, you need to update the data and set the dirty bit, after a long period of time, the data will be used or replaced before it is flushed into the primary storage. This is suspected of occupying the pitfalls. When the cache does not have the address data, worse, you need to first import the corresponding primary storage data (a cache line) into the cache, then update the data, set the dirty bit, and then wait for the data to be flushed back to the memory, this situation not only occupies the cache space, but also has a large overhead when importing data from the primary storage. The reason why data needs to be imported from the primary storage is that the cache writes data back to the primary storage according to a cache line unit, however, the updated data may not have as many cache lines. Therefore, to ensure data consistency, you must first import the data into the cache and update the data before flushing it back.
For many video decoding operations, the frame write process is a one-time action and will only be used for the next reference frame. Therefore, the frame buffer memory can be set as a write-through operation, the next time you use it, it is likely to be used as a reference frame. As a reference frame, you do not need to access it repeatedly. You only need to perform one read operation, therefore, the efficiency will not be reduced because it does not pass through the cache. Experiments show that this method can improve the efficiency of MPEG4 SP decoding by 20-30%.
Similar content cache operations include prefetch operations. prefetch is used to import data from the primary database to the cache, while the CPU does not need to wait. Continue to execute the next command, if the next command is also a bus operation, you must wait for the prefetch to complete before starting. Therefore, when this command is used, insert a command after the prefetch command that is equal to the number of clock required for a cache miss as much as possible. Then, the prefetch and Subsequent commands can be executed in parallel, thus, the waiting process is saved, which is equivalent to offsetting the loss of cache miss. Of course, if too many instructions are inserted and the cache is too small, it is possible that the prefetch data will be replaced after it enters the cache. Therefore, you need to evaluate it yourself. -> Cache Optimization
4. Summary
IDCT is the first step in the key step of video decoding. At present, it is generally implemented using fast algorithms, such as the Chen-Wang algorithm. The effect of C language and assembly is quite different.
Perform IDCT transformation on an 8x8 block, as shown in figure
For (I = 0; I <8; I ++)
Idct_row (block + 8 * I );
For (I = 0; I <8; I ++)
Idct_col (block + I );
After the compilation, it can reduce the memory bandwidth, improve the storage efficiency, and avoid unnecessary memory read/write.
Mplayer has made a lot of efforts in this regard. The files related to armv4 (S3C2440 belongs to the armv4l architecture) are placed in the dsputil_arm_s.s file. Unfortunately, it contains an instruction PLD, which is not supported by cache prefetch instruction 2440. The PLD command is an enhanced DSP command and is supported only when armv4e (E represents enhanced DSP). Therefore, the code running on our orchid must comment out this command; otherwise, it cannot be compiled.
Back the topic. Before IDCT, the video is compressed and circulated through the VLD (variable lenght decode) to get the DCT data.
This part of work is generally used to speed up the performance through the look-up table, all the encoding tables will be saved in advance. The code for retrieving video bit streams is usually macro,
Macro extension is used to achieve the same effect as assembly.
After IDCT, there are two key steps: Motion Compensation and color space conversion. The acceleration of motion compensation is also achieved through assembler, and its code is also placed in
It is necessary to mention dsputil_arm_s.s in this part. If there is a SIMD command, it will greatly increase its speed.
Color Space conversion is the most important step after decoding the output. In embedded systems, rgb565 is usually used to represent a pixel color with 16 bits.
An 8x8 block. Its YUV (420 format) is as follows,
Yyyyyyyy
Yyyyyyyy
Yyyyyyyy
Yyyyyyyy
Uuuuuuuuu
Vvvvvvvv
Note that the value is 8 bits. The pixel value can be obtained through the calculation of the load-in equation. In the implementation process, we usually use a look-up table to accelerate computing. For every y, U, V
A corresponding table. For a video of 76800 X, a total of pixels are displayed. If each pixel saves 10 cycle in this conversion, the CPU saved is quite impressive.
After the color space is converted, the picture is copied to the memory of framebuffer. There is a large copy time here. You can pay attention to the following two aspects,
First, someone has directly sent the converted memory to framebuffer to reduce the final copy process. This idea is really good, but it requires some skills to implement it.
Second, the copy process can be accelerated in this province. In the architecture above armv5, CPU ---- cache --- memory, where the width of cache and memory is 32 bits,
However, the bus width of the CPU and cache is indeed 64-bit, and 64-Bit Memory is realized at 32-bit cost. If this can be used, the copy speed can be doubled theoretically.
On a PC, our application usually has the fastmemorycopy function, which is implemented by using special commands such as SIMD. On armv5, It is accelerated by its bus width.
Unavailable on S3C2440: (it is a V4 architecture.
In general,
(1) algorithm-level optimization is basically not available, and FFMPEG/mplayer has implemented quite well, unless you implement a new decoder yourself;
(2) At the code level, acceleration is mainly through the inline (macro, inline function) and Assembly of key code. This part still has some potential to be tapped on the ARM platform.
(3) At the hardware level, the CPU architecture determines the instruction set and cache format and size. For example, whether the instruction set has enhanced DSP commands or SIMD commands
Whether the cache is configurable and the cache line size will affect code-level and algorithm-level optimization.
(4) The reason why system layer optimization is placed on the last layer is that it is built on the entire system and can be achieved only by having a deep understanding of the entire system, including hardware and software.
Throughout the optimization, the essence is to remove redundant computing as much as possible and maximize the use of system hardware resources.
For the cpu Of the RISC architecture, the inherent disadvantage is that it requires a relatively large memory bandwidth (because the instructions of the CPU are based on registers, the operands must be loaded into the memory for calculation ),
Too many CPU resources are used in read and write operations in the memory.
The following code is used as an example to decode the output and replace the YUV space with an RGB space.
000111c:
111c: e92d4ff0 analytic dB SP !, {R4, R5, R6, R7, R8, R9, SL, FP, LR}
1120: e1a0a000 mov SL, R0
1124: e5900038 LDR r0, [r0, #56]
1128: e1a0c001 mov IP, r1
112c: e3500004 CMP r0, #4; 0x4
1130: e24dd034 sub sp, SP, #52; 0x34
1134: e1a00002 mov r0, r2
1138: e1a01003 mov R1, r3
113c: 0a00055d beq 157c
1140: e59d2058 LDR R2, [Sp, #88]
1144: e3520000 CMP R2, #0; 0x0
1148: d1a00002 movle r0, r2
114c: da00055b ble 1574
1150: e59d3060 LDR R3, [Sp, #96]
1154: e58d1030 STR R1, [Sp, #48]
1158: e5933000 LDR R3, [R3]
115c: e59f2434 LDR R2, [PC, #1076]; 1598 <. Text + 0x1598>
1160: e0213193 MLA R1, R3, R1, r3
1164: e58d3018 STR R3, [Sp, #24]
1168: e59d305c LDR R3, [Sp, #92]
Running C: e58d1000 STR R1, [Sp]
1170: e5933000 LDR R3, [R3]
1174: e79a1002 LDR R1, [SL, R2]
1178: e58d301c STR R3, [Sp, #28]
117c: e5903008 LDR R3, [r0, #8]
1158: e5933000 LDR R3, [R3]
115c: e59f2434 LDR R2, [PC, #1076]; 1598 <. Text + 0x1598>
1160: e0213193 MLA R1, R3, R1, r3
1164: e58d3018 STR R3, [Sp, #24]
1168: e59d305c LDR R3, [Sp, #92]
Running C: e58d1000 STR R1, [Sp]
1170: e5933000 LDR R3, [R3]
1174: e79a1002 LDR R1, [SL, R2]
1178: e58d301c STR R3, [Sp, #28]
117c: e5903008 LDR R3, [r0, #8]
1180: e59c4008 LDR R4, [IP, #8]
1184: e590e000 ldr lr, [R0]
1188: e59c2000 LDR R2, [IP]
118C: e5900004 LDR r0, [r0, #4]
1190: e59cc004 ldr ip, [IP, #4]
1194: e58d3014 STR R3, [Sp, #20]
1198: e1a011c1 mov R1, R1, ASR #3; h_size
119c: e3a03000 mov R3, #0; 0x0
11a0: e58d4010 STR R4, [Sp, #16]
11a4: e58d2004 STR R2, [Sp, #4]
11a8: e58de020 str lr, [Sp, #32]
11ac: e58d000c STR r0, [Sp, #12]
11b0: e58dc008 str ip, [Sp, #8]
11b4: e58d1028 STR R1, [Sp, #40]
11b8: e58d3024 STR R3, [Sp, #36]
11bc: e1a08003 mov R8, r3
........................................ .........
........................................ .........
We can find that there are too many LDR (load, read from memory) and STR (store, wirte to memory) in this segment)
In addition, excessive load and STR also affect the cache efficiency between the CPU and memory, resulting in cache jitter. When cache miss occurs,
The cahce controller made great efforts to move the content from memory to cache, but it was not easy to use this entry and was immediately replaced. If you are not lucky,
The cache is always "jitters ".
In the decoding process, each module performs its own battles and each occupies a relatively large memory bandwidth.
How can we reduce this useless behavior? Key code must be adapted to the hardware architecture and Data Stream-related code must be coupled.
Many code modules provide excellent readability and scalability. The fish and the bear's paw cannot have both sides, and the code coupled together will become obscure.
FFmpeg/mplayer makes a good tradeoff in this regard.
Knowledge 1, 2, and 3 is taken from the Internet. To better understand the above content, you need some knowledge about video encoding and decoding.